Handling data too big to fit in memory - what is MXNet's Keras generator analogue?

lambdaofgod · June 19, 2018, 2:21pm

I’m doing training where my dataset won’t fit on single machine memory.

When I had similar problem in Keras I just used generators, for example the ones from keras.image.

What is the analogue in MXNet? I was looking into documentation on IO and on Gluon, but I didn’t find anything (or I’m wrong, for example datasets seem to have defined length and getter, so it seems like they’re stored in memory).

jmacglashan · June 19, 2018, 3:32pm

I think you want
mxnet.recordio.MXIndexedRecordIO.

It lets you randomly pull instances from a file into a batch, rather than pulling in the whole data file at once. Uses two files to do this. An index file, and the data file.

There are other similar iterators too, but I think that is the most flexible.

feevos · June 20, 2018, 10:05am

You need gluon.data.DataLoader and gluon.data.DataSet. Some tutorials are: 1, 2.

For example, in my case for semantic segmentation problems, I have a bunch of imgs-UniqueID.npy and imgs-UniqueID-mask.npy data in a directory with the following subdirectories training/imgs, training/masks, validation/imgs, validation/masks and so on for test (see code below). Then I am using the following dataset wrapper to load a single image:


import os
import numpy as np
import pandas as pd

from mxnet.gluon.data import dataset


class SemSegDataset(dataset.Dataset):
    """
    Usage: the user needs to provide a root directory that has the following structure: 

        root:
            training:
                imgs/
                masks/
            validation:
                imgs/
                masks/
            test:
                imgs/
                masks/


    Each of the corresponding imgs/ and masks/ directories must contain images (numpy format *.npy) where the mask has the same name component as the corresponding image. 
    E.g. img1 = 'img-2345-sdgh.npy'
         mask1= 'img-2345-sdgh-mask.npy'

    This is necessary so as the ordered dictionaries that are constructed to have the correcto correspondence between images and masks. 
    """

    def __init__(self, root, mode='train', transform=None,norm=None):

        # Transformation of augmented data
        self._mode = mode
        self._transform = transform
        self._norm = norm # Normalization of img

        # Take into account how root directory is entered
        if (root[-1]=='/'):
            self._root = root
        else :
            self._root = root + '/'


        if (self._mode == 'train'):
            self._root_img = self._root + 'training/imgs/'
            self._root_mask = self._root + 'training/masks/'

        elif (self._mode == 'val'):
            self._root_img = self._root + 'validation/imgs/'
            self._root_mask = self._root + 'validation/masks/'

        elif (self._mode == 'test'):
            self._root_img = self._root + 'test/imgs/'
            self._root_mask = self._root + 'test/masks/'

        else :
            raise Exception ('I was given inconcistent mode, available choices: {train, val, test}')



        # Read images and masks list - sorted so they are in correspondence. 
        self._image_list = sorted(os.listdir(self._root_img))
        self._mask_list = sorted(os.listdir(self._root_mask))


        assert len(self._image_list) == len(self._mask_list), "Masks and labels do not have same numbers, error"


    def __getitem__(self, idx):

        base_filepath = os.path.join(self._root_img, self._image_list[idx])
        mask_filepath = os.path.join(self._root_mask, self._mask_list[idx])


        # load in float32
        base = np.load(base_filepath)
        base = base.astype(np.float32)

        mask = np.load(mask_filepath)
        mask = mask.astype(np.float32)


        if self._transform is not None:
            base, mask = self._transform(base, mask)
            if self._norm is not None:
                base = self._norm(base.astype(np.float32))

            return base.astype(np.float32), mask.astype(np.float32)

        else:
            if self._norm is not None:
                base = self._norm(base.astype(np.float32))

            return base.astype(np.float32), mask.astype(np.float32)

    def __len__(self):
        return len(self._image_list)
~

This is more complicated as it is required the user to provide (optionally) a normalization function (for each image) (e.g. standardization) and a transform (this relates to data augmentation, see gluon tutorial). Then you can use this function (with a gluon.data.DataLoader) in a for loop to train your network in the following way.

# This is how you define it, with optional normalization and augmentation functions. 

Nbatch = 32

#tnorm = ISPRSNormal() # Some normalization function
tnorm = None
#ttransform = SemSegAugmentor() # Some data augmentation function 
ttransform = None
root = r'/home/foivos/Data/'
dataset = SemSegDataset(root,mode='train', norm=tnorm, transform=ttransform)
datagen = gluon.data.DataLoader(dataset,batch_size=Nbatch,last_batch='rollover',shuffle=True,num_workers=8)

and this is an example of a for loop to use it

for i, data in enumerate(datagen):
    imgs, masks = data
    # do stuff 

    break # this stops the iteration after first batch is loaded

Hope this helps. By the way, I don’t think numpy arrays are the most efficient way to go, works for me (but I haven’t done thorough code profiling).

olivcruche · January 17, 2019, 7:16pm

Hi Foivos what is the reason you stayed with one file per image? Wouldn’t it be more efficient to pack multiple records in .rec files and batch from those files?

feevos · January 18, 2019, 12:46am

Hi @olivcruche, when I started writing my code for semantic segmentation problems I needed custom image+mask transformations that at least back then (1.5y ago) didn’t exist in mxnet (or I couldn’t find them). So I coded up everything using opencv (in fact I started my deep learning journey with TF). Then - since this isn’t a bottleneck in my experiments - I stack with it. I want to explore the option of .rec files too, it is in my TODO list (no time for now).

Cheers,
F.

Topic		Replies	Views
Multi-label classification using mxnet	1	2284	June 29, 2018
Two separate models read same dataset file can cause a memory error?	0	301	May 17, 2020
Is there a function ImageFolderDataset for segmentation masks	1	351	November 6, 2018
Running inference with varying input size	3	1144	October 20, 2019
Training mxnet-rcnn without using jpeg images	4	675	July 26, 2018

Handling data too big to fit in memory - what is MXNet's Keras generator analogue?

Related Topics