Handling data too big to fit in memory - what is MXNet's Keras generator analogue?

I’m doing training where my dataset won’t fit on single machine memory.

When I had similar problem in Keras I just used generators, for example the ones from keras.image.

What is the analogue in MXNet? I was looking into documentation on IO and on Gluon, but I didn’t find anything (or I’m wrong, for example datasets seem to have defined length and getter, so it seems like they’re stored in memory).

I think you want
mxnet.recordio.MXIndexedRecordIO.

It lets you randomly pull instances from a file into a batch, rather than pulling in the whole data file at once. Uses two files to do this. An index file, and the data file.

There are other similar iterators too, but I think that is the most flexible.

You need gluon.data.DataLoader and gluon.data.DataSet. Some tutorials are: 1, 2.

For example, in my case for semantic segmentation problems, I have a bunch of imgs-UniqueID.npy and imgs-UniqueID-mask.npy data in a directory with the following subdirectories training/imgs, training/masks, validation/imgs, validation/masks and so on for test (see code below). Then I am using the following dataset wrapper to load a single image:


import os
import numpy as np
import pandas as pd

from mxnet.gluon.data import dataset


class SemSegDataset(dataset.Dataset):
    """
    Usage: the user needs to provide a root directory that has the following structure: 

        root:
            training:
                imgs/
                masks/
            validation:
                imgs/
                masks/
            test:
                imgs/
                masks/


    Each of the corresponding imgs/ and masks/ directories must contain images (numpy format *.npy) where the mask has the same name component as the corresponding image. 
    E.g. img1 = 'img-2345-sdgh.npy'
         mask1= 'img-2345-sdgh-mask.npy'

    This is necessary so as the ordered dictionaries that are constructed to have the correcto correspondence between images and masks. 
    """

    def __init__(self, root, mode='train', transform=None,norm=None):

        # Transformation of augmented data
        self._mode = mode
        self._transform = transform
        self._norm = norm # Normalization of img

        # Take into account how root directory is entered
        if (root[-1]=='/'):
            self._root = root
        else :
            self._root = root + '/'


        if (self._mode == 'train'):
            self._root_img = self._root + 'training/imgs/'
            self._root_mask = self._root + 'training/masks/'

        elif (self._mode == 'val'):
            self._root_img = self._root + 'validation/imgs/'
            self._root_mask = self._root + 'validation/masks/'

        elif (self._mode == 'test'):
            self._root_img = self._root + 'test/imgs/'
            self._root_mask = self._root + 'test/masks/'

        else :
            raise Exception ('I was given inconcistent mode, available choices: {train, val, test}')



        # Read images and masks list - sorted so they are in correspondence. 
        self._image_list = sorted(os.listdir(self._root_img))
        self._mask_list = sorted(os.listdir(self._root_mask))


        assert len(self._image_list) == len(self._mask_list), "Masks and labels do not have same numbers, error"


    def __getitem__(self, idx):

        base_filepath = os.path.join(self._root_img, self._image_list[idx])
        mask_filepath = os.path.join(self._root_mask, self._mask_list[idx])


        # load in float32
        base = np.load(base_filepath)
        base = base.astype(np.float32)

        mask = np.load(mask_filepath)
        mask = mask.astype(np.float32)


        if self._transform is not None:
            base, mask = self._transform(base, mask)
            if self._norm is not None:
                base = self._norm(base.astype(np.float32))

            return base.astype(np.float32), mask.astype(np.float32)

        else:
            if self._norm is not None:
                base = self._norm(base.astype(np.float32))

            return base.astype(np.float32), mask.astype(np.float32)

    def __len__(self):
        return len(self._image_list)
~                                                                     

This is more complicated as it is required the user to provide (optionally) a normalization function (for each image) (e.g. standardization) and a transform (this relates to data augmentation, see gluon tutorial). Then you can use this function (with a gluon.data.DataLoader) in a for loop to train your network in the following way.

# This is how you define it, with optional normalization and augmentation functions. 

Nbatch = 32

#tnorm = ISPRSNormal() # Some normalization function
tnorm = None
#ttransform = SemSegAugmentor() # Some data augmentation function 
ttransform = None
root = r'/home/foivos/Data/'
dataset = SemSegDataset(root,mode='train', norm=tnorm, transform=ttransform)
datagen = gluon.data.DataLoader(dataset,batch_size=Nbatch,last_batch='rollover',shuffle=True,num_workers=8)

and this is an example of a for loop to use it

for i, data in enumerate(datagen):
    imgs, masks = data
    # do stuff 

    break # this stops the iteration after first batch is loaded

Hope this helps. By the way, I don’t think numpy arrays are the most efficient way to go, works for me (but I haven’t done thorough code profiling).

1 Like

Hi Foivos what is the reason you stayed with one file per image? Wouldn’t it be more efficient to pack multiple records in .rec files and batch from those files?

1 Like

Hi @olivcruche, when I started writing my code for semantic segmentation problems I needed custom image+mask transformations that at least back then (1.5y ago) didn’t exist in mxnet (or I couldn’t find them). So I coded up everything using opencv (in fact I started my deep learning journey with TF). Then - since this isn’t a bottleneck in my experiments - I stack with it. I want to explore the option of .rec files too, it is in my TODO list (no time for now).

Cheers,
F.