Contradiction in .rec documentation

olivcruche · January 18, 2019, 12:31pm

Hi, there are two contradicting statements in the recordIO documentation https://mxnet.incubator.apache.org/architecture/note_data_loading.html:

“Do the packing once. We don’t want to repack data every time run-time settings, like the number of machines, are changed”
“We don’t need to consider distributed loading issue at the preparation time, just select the most efficient physical file number according to the dataset size and computing resources available.”

Consequently proper usage of .rec is not clear: how config-specific should .rec dataset be? should we hyper-parametrize number of files and cross-validate it every time run-time settings change?

Related question Batch formation from .rec files

Cheers

VishaalKapoor · January 19, 2019, 3:10am

Hi @olivcruche

Ultimately you’d be writing your .recs once instead of for each k-fold cross validation.

I believe you’re asking a similar question as to https://github.com/apache/incubator-mxnet/issues/1252 where someone mentioned chunking in combination to k-fold cross validation. Note, I don’t believe using chunking this way is correct, as chunking is a performance parameter for pre-loading (for example, it’s limited to be within 4MB and 4096MB for ImageRecordIO). Instead it sounds like you’d like to implement a KFold iterator like GroupKFold in sklearn.

I see several options, in theory: you could re-create your record io for each split (not recommended), or create several ImageRecordIO iterators each for different splits (these are not guaranteed to be non-overlapping) as described in issue 1252, or do something non-overlapping like the GroupKFold example, or use a random access RecordIO interface like MXIndexedRecordIO with KFold from sklearn which returns the partitions.

Vishaal

olivcruche · January 21, 2019, 8:27am

Thanks Vishaal, I’m not specifically interested in kFold CV, I’m just wondering how would one decide in how many (and what size) .rec files should a dataset be split

VishaalKapoor · January 21, 2019, 6:56pm

I misunderstood the question. I thought you were asking if you would re-split every (k-Fold) cross-validation.

It’s difficult to make a very specific recommendation, as ultimately you’re dealing with a hyper (hyper) parameter as you mention. But see below:

The main benefit of splitting into multiple files is for distributed training, so that you can read data in parallel. So if you have n workers, splitting your data into k \cdot n files with k=1 is a reasonable idea. You may treat k as a hyperparameter if you’re dealing with very large files because there are OS and hardware constraints and potentially speed limitations of using huge files. Additionally transferring a 10gigabyte file to S3 may be slower than transferring 10 one gig files in parallel. In the other direction, many small files will not allow for contiguous reads and will make inefficient use of your harddrive (e.g. not filling your page size, having long seek distance for platters).

Vishaal

Topic		Replies	Views
How to load multi-rec files efficiently? Performance	1	769	April 3, 2019
Batch formation from .rec files Gluon	3	826	January 21, 2019
Combining .rec files from im2rec's "chunk" option	0	399	May 25, 2021
Read .rec into memory and get data stats Gluon	1	692	October 15, 2018
Limited bit depth of the recordIO iterator Discussion	3	410	July 1, 2019

Contradiction in .rec documentation

Related Topics