MNIST: DataLoader very slow compared to DataIter?

olk · January 16, 2020, 7:07pm

Hi,
some tutorials advice to use always DataLoader over the older DataIter API.
I did some measurements using the MNIST dataset.
The code using DataIter is twice as fast as with DataLoader.
So do the tutorials give a false advice?
If I have to create/use custom data source I should use the DataIter approach?!

DataIter version: https://pastebin.com/xUQzi5G4

Epoch [1], Accuracy 0.7693 ~Samples/Sec 55691.5675
Epoch [2], Accuracy 0.9245 ~Samples/Sec 83681.7507
Epoch [3], Accuracy 0.9554 ~Samples/Sec 83172.9648
Epoch [4], Accuracy 0.9657 ~Samples/Sec 83173.5480
Epoch [5], Accuracy 0.9713 ~Samples/Sec 82785.9656
Epoch [6], Accuracy 0.9767 ~Samples/Sec 83346.3328
Epoch [7], Accuracy 0.9791 ~Samples/Sec 83410.4360
Epoch [8], Accuracy 0.9812 ~Samples/Sec 83240.0001
Epoch [9], Accuracy 0.9825 ~Samples/Sec 83247.2604
Epoch [10], Accuracy 0.9833 ~Samples/Sec 82463.6815
elapsed: 7.509
validation accuracy=0.988498

DataLoader version: https://pastebin.com/kgpqiRYc

Epoch [0], Accuracy 0.8295 ~Samples/Sec 39718.0944
Epoch [1], Accuracy 0.9523 ~Samples/Sec 45847.0761
Epoch [2], Accuracy 0.9683 ~Samples/Sec 47455.3581
Epoch [3], Accuracy 0.9731 ~Samples/Sec 46995.5257
Epoch [4], Accuracy 0.9784 ~Samples/Sec 43869.2088
Epoch [5], Accuracy 0.9809 ~Samples/Sec 46773.8558
Epoch [6], Accuracy 0.9831 ~Samples/Sec 46672.8191
Epoch [7], Accuracy 0.9848 ~Samples/Sec 44058.3605
Epoch [8], Accuracy 0.9856 ~Samples/Sec 45939.9594
Epoch [9], Accuracy 0.9866 ~Samples/Sec 46450.3611
elapsed: 15.238
validation accuracy=0.988700

Oliver

mouryarishik · January 17, 2020, 9:11am

Data loader is always meant to be slow from a Data iterator.

We use data iterator when the dataset is small enough so that it can be loaded to the available memory(RAM or VRAM), therefore when we use data iterator the dataset is already loaded to the memory, which is then simply iterated based on batch size provided.

Whilst, when the dataset is so big that it can’t be loaded completely to the memory, then we are forced to use a data loader, which doesn’t try to load whole dataset in the memory, instead just loads the current batch to the memory, and then releases the previous batch out of memory to load the next batch.

And doing this makes a data loaded slower than a data iterator, as a data loader has to continuously allocate and deallocate memory.

olk · January 17, 2020, 10:52am

But couldn’t this be done by custom DataIter too (loading the data on demand).

DataLoader itself creates and returns a _MultiWorkerIter that implements some kind of parallel batch creation from the dataset.
But if the next batches are created in parallel to the consumption of the current batch the performance shouldn’t be that worse.

The data I’m working on consist out of 200.000 recordio files (240GB) containing numpy arrays (offline processing). The number of arrays differs from file to file.
If implemented a iterator (DataIter) that extracts and batches the numpy array from the recordio files in parallel.
The iterator is capable to saturate both GPUs to 93%.
But I’m searching for a more efficient solution that is ‘more’ MXNet compliant.

Topic		Replies	Views
Waiting too long after each epoch	1	546	July 16, 2018
Multiple dataloader will slow the training performance Performance	1	532	December 27, 2018
Speeding up Machine Translation with RNNs D2L Book performance , gpu , docs	3	421	February 22, 2019
Validation accuracy for object detection on custom dataset (SSD model) Gluon	0	425	December 2, 2019
Trying to modify SSD lesson to work with Pascal dataset. Got low training loss but terrible prediction result Courses	7	717	October 31, 2019

MNIST: DataLoader very slow compared to DataIter?

Related Topics