Keeping filename with data in DataLoader

Suppose I have a set of images that I have loaded into a DataLoader object, with the labels (class values). How can I also attach the filename of the image, to retrieve at a later point? So, this set of images could be from the validation set, where we know the labels.

Hi,

I’ll build my answer on a classification scenario: say we have a bunch of img files stored in some directory, '/myImgDir/'. The convention we follow is that in each image name we also have the information of the class. Example, file img1456_1.jpg is image with index 1456 and class value 1. Or, img125_5.jpg is the image with index 125, and class label 5 (and so on). Say also that we read them into some python program and split them in train/test/validation sets, so we store the names of the files in data_train.txt, data_val.txt and data_test.txt. Depending on what we are doing (training or testing), our dataset will read the corresponding files.

import pandas as pd
from matplotlib.pyplot import imread
import glob
class MyDataSet(gluon.data.Dataset):
    """
        INPUT: root directory, mode (train, test, val), and transform of data (optional) and normalization of data (optional)
    """

    def __init__(self, root, mode='train', transform = None, norm = None, channels_first = True):


        self.transform = transform  # data  transformatiton function 
        self.norm = norm # Normalization and restore of data. 
        self.channels_first  = channels_first

        # Some times we may give root directory as "/myroot/" and some others as "/myroot"
        # so let's be able to read in all possible scenarios. 
        if (root[-1] == '/'):
            self.root = root
        else :
            self.root = root + r'/'

        # Let's decide what this dataset is going to be, for training or testing/validation?
        # Read the corresponding file names. 
        if mode == 'train':
            self.df = pd.read_csv(self.root + r'data_train.txt')
        elif mode == 'val':
            self.df = pd.read_csv(self.root + r'data_val.txt')
        elif mode == 'test':
            self.df = pd.read_csv(self.root + r'data_test.txt')
        else:
            raise ValueError ("mode is not train, val or test,  aborting")

         # Now let's read ALL images 
        self.img_names = glob.glob(self.root+'*.jpg')

        # And let's keep the ones that are the intersection with the corresponding 
        # txt file that has the filenames for the particular mode we are interested 
        # (e.g., from all images, some are train, some are test etc) 
        names = []
        for name in self.img_names:
            names += [name.replace(self.root,"")]

        self.img_names = names
        self.img_names = set(self.img_names).intersection(set(np.ravel(self.df.values)))
        self.img_names = list(self.img_names)

        # Restore back the filename with the root prefix 
        # I was lazy, in the data_train.txt it's just the img names, without the root directory path. 
        names = []
        for name in self.img_names:
            names += [root + name]

        self.img_names = names

     # Convenience function to read the image and corresponding label. Notice that it returns a TRIPLET
    # img, label, image name
    def read_img(self,name):
        img = imread(name)

        if (self.channels_first):
            img = img.transpose([2,0,1])

        label = int(name[-5])

        if (self.norm != None):
            img = self.norm(img)

        if (self.transform != None):
            img = self.transform(img)

        return img,label, name

    def __len__(self):
        return len(self.img_names)


    def __getitem__(self, idx):

        name=self.img_names[idx]

        return self.read_img(name) # Returns a triplet of values, img, label, name of image

So how do we use this,

root = '/myImgDir/'
dataset _train = MyDataSet(root)
dataloader = mx.gluon.data.DataLoader(dataset_train, batch_size=32, num_workers=24) # am greedy on cpus

for i, data in enumerate(dataloader):
    img_batch, label_batch, names_batch = data
    break 

so now img_batch is a batch of size 32 containing all your images, label_batch has the corresponding labels, and names_batch has the corresponding names.

Hope this helps?
Cheers

2 Likes

Exactly what I was after! Thanks for taking the time out to answer. :grin:

2 Likes