Some samples were skipped in the training loop

While training a model with the Module API, i.e. module.fit() with an NDArrayIter, I noticed that more than half the samples were skipped over in the first epoch. In the subsequent epochs, it looks like all the samples were used.

This is from a multi-label classification problem with about 3.5 million samples, encoded into sparse matrices. Training was performed on a GPU, using a batch size of 128.

From the Speedometer logs during training:

06:29:01 WARNING:Already bound, ignoring bind()
06:29:01 WARNING:optimizer already initialized, ignoring...
06:29:29 INFO:Epoch[0] Batch [5000]	Speed: 23131.93 samples/sec	loss=0.000037
06:29:56 INFO:Epoch[0] Batch [10000]	Speed: 23457.39 samples/sec	loss=0.000037
06:30:23 INFO:Epoch[0] Train-loss=0.000068
06:30:23 INFO:Epoch[0] Time cost=81.726
06:30:51 INFO:Epoch[1] Batch [5000]	Speed: 23192.88 samples/sec	loss=0.000110
06:31:18 INFO:Epoch[1] Batch [10000]	Speed: 23508.18 samples/sec	loss=0.000113
06:31:45 INFO:Epoch[1] Batch [15000]	Speed: 23572.26 samples/sec	loss=0.000113
06:32:12 INFO:Epoch[1] Batch [20000]	Speed: 23616.47 samples/sec	loss=0.000113
06:32:40 INFO:Epoch[1] Batch [25000]	Speed: 23192.46 samples/sec	loss=0.000112
06:32:48 INFO:Epoch[1] Train-loss=0.000111
06:32:48 INFO:Epoch[1] Time cost=145.542
06:33:16 INFO:Epoch[2] Batch [5000]	Speed: 23121.89 samples/sec	loss=0.000110
06:33:44 INFO:Epoch[2] Batch [10000]	Speed: 23425.52 samples/sec	loss=0.000109
06:34:11 INFO:Epoch[2] Batch [15000]	Speed: 23443.41 samples/sec	loss=0.000108
06:34:38 INFO:Epoch[2] Batch [20000]	Speed: 23487.69 samples/sec	loss=0.000107
06:35:05 INFO:Epoch[2] Batch [25000]	Speed: 23414.13 samples/sec	loss=0.000106
06:35:14 INFO:Epoch[2] Train-loss=0.000106
06:35:14 INFO:Epoch[2] Time cost=145.740
06:35:42 INFO:Epoch[3] Batch [5000]	Speed: 23379.98 samples/sec	loss=0.000105
06:36:09 INFO:Epoch[3] Batch [10000]	Speed: 23363.26 samples/sec	loss=0.000104
06:36:36 INFO:Epoch[3] Batch [15000]	Speed: 23450.45 samples/sec	loss=0.000103
06:37:04 INFO:Epoch[3] Batch [20000]	Speed: 23494.26 samples/sec	loss=0.000102
06:37:31 INFO:Epoch[3] Batch [25000]	Speed: 23388.84 samples/sec	loss=0.000102
06:37:39 INFO:Epoch[3] Train-loss=0.000102
06:37:39 INFO:Epoch[3] Time cost=145.431
   ...

Note that the time taken in the first epoch is also significantly lower (81s vs 145s for the rest of the epochs). So it does not seem an issue with the Speedometer logging.

The training loop is fairly uncomplicated:

        train_iter = mx.io.NDArrayIter(train_data,
                               train_labels, batch_size=batch_size,
                               last_batch_handle='discard', data_name='X', label_name='Y')

        speedometer = mx.callback.Speedometer(batch_size, speedometer_frequency)

        module.fit(train_iter,
                         eval_metric='loss',
                         batch_end_callback=speedometer,
                         num_epoch=num_epoch,
                         kvstore=kvstore.create('device')
                         )

Any suggestions on tracking this down?

Can you try resetting the iterator in every epoch train_iter.reset()? This will hopefully fix the problem.

D’oh. That may just be it. I have been running this in an interactive session, and likely missed resetting the iterator.