Hi,
some tutorials advice to use always DataLoader over the older DataIter API.
I did some measurements using the MNIST dataset.
The code using DataIter is twice as fast as with DataLoader.
So do the tutorials give a false advice?
If I have to create/use custom data source I should use the DataIter approach?!
Data loader is always meant to be slow from a Data iterator.
We use data iterator when the dataset is small enough so that it can be loaded to the available memory(RAM or VRAM), therefore when we use data iterator the dataset is already loaded to the memory, which is then simply iterated based on batch size provided.
Whilst, when the dataset is so big that it can’t be loaded completely to the memory, then we are forced to use a data loader, which doesn’t try to load whole dataset in the memory, instead just loads the current batch to the memory, and then releases the previous batch out of memory to load the next batch.
And doing this makes a data loaded slower than a data iterator, as a data loader has to continuously allocate and deallocate memory.
But couldn’t this be done by custom DataIter too (loading the data on demand).
DataLoader itself creates and returns a _MultiWorkerIter that implements some kind of parallel batch creation from the dataset.
But if the next batches are created in parallel to the consumption of the current batch the performance shouldn’t be that worse.
The data I’m working on consist out of 200.000 recordio files (240GB) containing numpy arrays (offline processing). The number of arrays differs from file to file.
If implemented a iterator (DataIter) that extracts and batches the numpy array from the recordio files in parallel.
The iterator is capable to saturate both GPUs to 93%.
But I’m searching for a more efficient solution that is ‘more’ MXNet compliant.