Guidance for big data loading with MXNet

opringle · October 12, 2018, 1:15am

I am designing a recommender system, which will train on user to item implicit interaction data. The size of the data is so large that it will not fit in memory. The label is binary & initial features will be categorical & continuous, however, in future the network should ingest images, text and sequential data etc.

It is critical I can train the model very quickly, which may necessitate training on a GPU cluster. Although initially I expect to get away with a large multi GPU instance.

I’m looking for guidance/links to examples on:

where to store my data
what format to store it in
how to best feed my network

My research suggests recordIO is the best practice approach for storage format. This thread agrees, however, I’ve seen other threads mention using csv iterators or numpy memory maps. Furthermore, every use case I see is with images only.

sad · October 17, 2018, 6:01pm

Here are some ideas for each of your questions.

where to store my data?
This depends on how large the data is. Will it fit on disk for a large multi GPU instance, if so this should be the preferred solution. Otherwise, you might want to consider S3 or other object storage mechanism
what format to store it in?
Like you said, for image data ImageRecordIO is a good idea. There may be similar compressed/optimized storage formats for text, sequential data but you need to factor in the costs incurred in converting from the default format your data comes in. I think the best bet in terms of performance is to load data in parallel using multiprocessing which you get, in the answer to question 3, by using a DataLoader and setting num_workers to the number of CPUs on the machine.
how to best feed my network?
You should probably use gluon.data.Dataset and gluon.data.DataLoader. See the tutorial here for more details.
You can either define your custom dataset that extends gluon.data.Dataset or use one of the provided custom datasets like gluon.data.vision.datasets.ImageFolderDataset for raw images or gluon.data.vision.datasets.ImageRecordDataset for ImageRecordIO objects and define a DataLoader with the dataset option you choose to feed your network. You could also directly implement your custom data loader that where you only need to implement a __iter__ function that yields a batch of data. For text and sequential data set you can take a look at the gluonnlp.data.SimpleDatasetStream.

Topic		Replies	Views
Custom dataset similar to Python generator? Gluon	2	451	July 9, 2020
Single-node low-disk footprint data loading Discussion	4	465	July 1, 2019
Distributed Training / Model Parallelism with sparse embeddings in Gluon Gluon	2	538	June 19, 2019
Loading sparse data into gluon's DataLoader? Gluon	2	521	December 1, 2019
Can data loader work with different input shape Gluon	3	1402	November 15, 2018

Guidance for big data loading with MXNet

Related Topics