Single-node low-disk footprint data loading

Hi, I’m research single-node data loading approaches that have minimal disk footprint, to train on single node on sharded datasets way larger than local disk. I read in the doc than any iterator reading from file can also read from S3. Does that mean that iterators like or can form batches of N records in memory, from S3 files containing arbitrary numbers of records, without having to load the whole training set on local disk? Would, have similar skills?

Hi @olivcruche,

Gluon Datasets don’t currently have this feature, but you can still use DataIters for Gluon training if you need. Check out the last section of this tutorial and you can see a simple method for converting the schema of the data, if you want to keep the Gluon training code unchanged.

With regards to partitioning across S3 objects, you could try partitioning your data into multiple record files and then try referencing the prefix of the S3 objects, but I haven’t tried this and I can’t find any documentation for it! I spot that the method for listing objects in a bucket (for a given prefix) has been implemented in dmlc-core:

Would be keen to hear if you make any progress on this, and I will try to implement an S3Dataset when I get a chance.

alrighty thanks Thom; but you confirm that io iterators like or do have this property, of being able to batch directly in memory from S3, without having to store everything on local disk?

Seems to be the case looking at, but I haven’t tried this personally and couldn’t find documentation that confirms either way. @tqchen are you familiar with how this is handled in dmlc-core and how it would apply to MXNet?

Please try the DatasetLoader and DatasetStream in GluonNLP: