Suppose I have a set of images that I have loaded into a DataLoader object, with the labels (class values). How can I also attach the filename of the image, to retrieve at a later point? So, this set of images could be from the validation set, where we know the labels.
Hi,
I’ll build my answer on a classification scenario: say we have a bunch of img files stored in some directory, '/myImgDir/'
. The convention we follow is that in each image name we also have the information of the class. Example, file img1456_1.jpg
is image with index 1456 and class value 1. Or, img125_5.jpg
is the image with index 125, and class label 5 (and so on). Say also that we read them into some python program and split them in train/test/validation sets, so we store the names of the files in data_train.txt
, data_val.txt
and data_test.txt
. Depending on what we are doing (training or testing), our dataset will read the corresponding files.
import pandas as pd
from matplotlib.pyplot import imread
import glob
class MyDataSet(gluon.data.Dataset):
"""
INPUT: root directory, mode (train, test, val), and transform of data (optional) and normalization of data (optional)
"""
def __init__(self, root, mode='train', transform = None, norm = None, channels_first = True):
self.transform = transform # data transformatiton function
self.norm = norm # Normalization and restore of data.
self.channels_first = channels_first
# Some times we may give root directory as "/myroot/" and some others as "/myroot"
# so let's be able to read in all possible scenarios.
if (root[-1] == '/'):
self.root = root
else :
self.root = root + r'/'
# Let's decide what this dataset is going to be, for training or testing/validation?
# Read the corresponding file names.
if mode == 'train':
self.df = pd.read_csv(self.root + r'data_train.txt')
elif mode == 'val':
self.df = pd.read_csv(self.root + r'data_val.txt')
elif mode == 'test':
self.df = pd.read_csv(self.root + r'data_test.txt')
else:
raise ValueError ("mode is not train, val or test, aborting")
# Now let's read ALL images
self.img_names = glob.glob(self.root+'*.jpg')
# And let's keep the ones that are the intersection with the corresponding
# txt file that has the filenames for the particular mode we are interested
# (e.g., from all images, some are train, some are test etc)
names = []
for name in self.img_names:
names += [name.replace(self.root,"")]
self.img_names = names
self.img_names = set(self.img_names).intersection(set(np.ravel(self.df.values)))
self.img_names = list(self.img_names)
# Restore back the filename with the root prefix
# I was lazy, in the data_train.txt it's just the img names, without the root directory path.
names = []
for name in self.img_names:
names += [root + name]
self.img_names = names
# Convenience function to read the image and corresponding label. Notice that it returns a TRIPLET
# img, label, image name
def read_img(self,name):
img = imread(name)
if (self.channels_first):
img = img.transpose([2,0,1])
label = int(name[-5])
if (self.norm != None):
img = self.norm(img)
if (self.transform != None):
img = self.transform(img)
return img,label, name
def __len__(self):
return len(self.img_names)
def __getitem__(self, idx):
name=self.img_names[idx]
return self.read_img(name) # Returns a triplet of values, img, label, name of image
So how do we use this,
root = '/myImgDir/'
dataset _train = MyDataSet(root)
dataloader = mx.gluon.data.DataLoader(dataset_train, batch_size=32, num_workers=24) # am greedy on cpus
for i, data in enumerate(dataloader):
img_batch, label_batch, names_batch = data
break
so now img_batch is a batch of size 32 containing all your images, label_batch has the corresponding labels, and names_batch has the corresponding names.
Hope this helps?
Cheers
Exactly what I was after! Thanks for taking the time out to answer.