Kvstore for distributed multi-gpu training

I am running a distributed training job (resnet-34 / cifar10) on a 2-4 node cluster of p2.8xlarge instances (which have 8 gpus each). The advice in the multi-devices howto doc suggests dist_device_sync is a good choice for kvstore, but when I run the job this way, the model never learns – i.e. accuracy never gets better than random no matter how long I train.

When I use kvstore = dist_sync the training job works fine.

Am I doing something wrong, or is there a problem with dist_device_sync?

I am using mxnet-cu80 0.12 (the release version) on AWS p2.8xlarge instances.

Hi,

Are you using module or gluon API for training? Do you mind pasting the training log and setup (batch_size, learning_rate) for dist_sync and dist_device_sync?

Here’s the correct log.

kvstore was dist_device_sync
gpus = 16
nodes = 2
batch size was 128 * gpus
learning rate was 0.1 * gpus <-- maybe the problem?

Training log for node 1 (node 2 was similar):

[12:46:43] src/io/iter_image_recordio_2.cc:169: ImageRecordIOParser2: /opt/ml/input/data/training/train.rec, use 31 threads for decoding..
 [12:47:27] src/io/iter_image_recordio_2.cc:169: ImageRecordIOParser2: /opt/ml/input/data/training/test.rec, use 31 threads for decoding..
 [12:47:45] src/operator/././cudnn_algoreg-inl.h:106: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
 [12:48:00] src/kvstore/././comm.h:579: only 114 out of 240 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
 [12:48:00] src/kvstore/././comm.h:588: .vvvvvvvv.......
 [12:48:00] src/kvstore/././comm.h:588: v.vvvvvvv.......
 [12:48:00] src/kvstore/././comm.h:588: vv.vvvvvv.......
 [12:48:00] src/kvstore/././comm.h:588: vvv.vvvvv.......
 [12:48:00] src/kvstore/././comm.h:588: vvvv.vvvv.......
 [12:48:00] src/kvstore/././comm.h:588: vvvvv.vvv.......
 [12:48:00] src/kvstore/././comm.h:588: vvvvvv.vv.......
 [12:48:00] src/kvstore/././comm.h:588: vvvvvvv.v.......
 [12:48:00] src/kvstore/././comm.h:588: vvvvvvvv........
 [12:48:00] src/kvstore/././comm.h:588: ..........vvvvvv
 [12:48:00] src/kvstore/././comm.h:588: .........v.vvvvv
 [12:48:00] src/kvstore/././comm.h:588: .........vv.vvvv
 [12:48:00] src/kvstore/././comm.h:588: .........vvv.vvv
 [12:48:00] src/kvstore/././comm.h:588: .........vvvv.vv
 [12:48:00] src/kvstore/././comm.h:588: .........vvvvv.v
 [12:48:00] src/kvstore/././comm.h:588: .........vvvvvv.
 2017-11-03 12:48:06,741 INFO - root - Epoch [0] Batch [1]#011Speed: 2559.744593 samples/sec#011accuracy=0.095459
 2017-11-03 12:48:07,441 INFO - root - Epoch [0] Batch [2]#011Speed: 2928.739573 samples/sec#011accuracy=0.098796
 2017-11-03 12:48:08,039 INFO - root - Epoch [0] Batch [3]#011Speed: 3425.923547 samples/sec#011accuracy=0.097412
 2017-11-03 12:48:08,536 INFO - root - Epoch [0] Batch [4]#011Speed: 4120.493250 samples/sec#011accuracy=0.097949
 2017-11-03 12:48:09,015 INFO - root - Epoch [0] Batch [5]#011Speed: 4280.194206 samples/sec#011accuracy=0.098307
 2017-11-03 12:48:09,497 INFO - root - Epoch [0] Batch [6]#011Speed: 4250.700501 samples/sec#011accuracy=0.100307
 2017-11-03 12:48:09,995 INFO - root - Epoch [0] Batch [7]#011Speed: 4108.029366 samples/sec#011accuracy=0.099854
 2017-11-03 12:48:10,479 INFO - root - Epoch [0] Batch [8]#011Speed: 4233.949729 samples/sec#011accuracy=0.100694
 2017-11-03 12:48:10,951 INFO - root - Epoch [0] Batch [9]#011Speed: 4346.664348 samples/sec#011accuracy=0.100049
 2017-11-03 12:48:11,458 INFO - root - Epoch [0] Batch [10]#011Speed: 4039.598046 samples/sec#011accuracy=0.099831
 2017-11-03 12:48:11,952 INFO - root - Epoch [0] Batch [11]#011Speed: 4148.923127 samples/sec#011accuracy=0.100505
 2017-11-03 12:48:12,467 INFO - root - Epoch [0] Batch [12]#011Speed: 3977.285587 samples/sec#011accuracy=0.100586
 2017-11-03 12:48:12,963 INFO - root - Epoch [0] Batch [13]#011Speed: 4131.673240 samples/sec#011accuracy=0.100586
 2017-11-03 12:48:13,472 INFO - root - Epoch [0] Batch [14]#011Speed: 4019.760428 samples/sec#011accuracy=0.100944
 2017-11-03 12:48:13,922 INFO - root - Epoch [0] Batch [15]#011Speed: 4554.551692 samples/sec#011accuracy=0.100983
 2017-11-03 12:48:14,436 INFO - root - Epoch [0] Batch [16]#011Speed: 3985.312500 samples/sec#011accuracy=0.100730
 2017-11-03 12:48:14,947 INFO - root - Epoch [0] Batch [17]#011Speed: 4010.873141 samples/sec#011accuracy=0.100260
 2017-11-03 12:48:15,465 INFO - root - Epoch [0] Batch [18]#011Speed: 3955.196169 samples/sec#011accuracy=0.100560
 2017-11-03 12:48:15,971 INFO - root - Epoch [0] Batch [19]#011Speed: 4050.497494 samples/sec#011accuracy=0.100757
 2017-11-03 12:48:16,481 INFO - root - Epoch [0] Batch [20]#011Speed: 4018.293710 samples/sec#011accuracy=0.100237
 2017-11-03 12:48:16,957 INFO - root - Epoch [0] Batch [21]#011Speed: 4302.304877 samples/sec#011accuracy=0.100519
 2017-11-03 12:48:17,434 INFO - root - Epoch [0] Batch [22]#011Speed: 4291.782793 samples/sec#011accuracy=0.100034
 2017-11-03 12:48:17,932 INFO - root - Epoch [0] Batch [23]#011Speed: 4120.916276 samples/sec#011accuracy=0.100179
 2017-11-03 12:48:18,400 INFO - root - Epoch [0] Batch [24]#011Speed: 4373.459969 samples/sec#011accuracy=0.100234
 2017-11-03 12:48:18,400 INFO - root - [Epoch 0] training: accuracy=0.100234
 2017-11-03 12:48:18,401 INFO - root - [Epoch 0] time cost: 50.791865
 2017-11-03 12:48:19,618 INFO - root - [Epoch 0] validation: accuracy=0.100586
 2017-11-03 12:48:21,094 INFO - root - Epoch [1] Batch [1]#011Speed: 4031.099580 samples/sec#011accuracy=0.093506
 2017-11-03 12:48:21,600 INFO - root - Epoch [1] Batch [2]#011Speed: 4050.359981 samples/sec#011accuracy=0.098796
 2017-11-03 12:48:22,083 INFO - root - Epoch [1] Batch [3]#011Speed: 4242.299152 samples/sec#011accuracy=0.099121
 2017-11-03 12:48:22,576 INFO - root - Epoch [1] Batch [4]#011Speed: 4154.157958 samples/sec#011accuracy=0.099902
 2017-11-03 12:48:23,062 INFO - root - Epoch [1] Batch [5]#011Speed: 4216.862568 samples/sec#011accuracy=0.101237
 2017-11-03 12:48:23,542 INFO - root - Epoch [1] Batch [6]#011Speed: 4271.801317 samples/sec#011accuracy=0.101214
 2017-11-03 12:48:24,046 INFO - root - Epoch [1] Batch [7]#011Speed: 4064.048084 samples/sec#011accuracy=0.101440
 2017-11-03 12:48:24,541 INFO - root - Epoch [1] Batch [8]#011Speed: 4137.506222 samples/sec#011accuracy=0.100857
 2017-11-03 12:48:25,034 INFO - root - Epoch [1] Batch [9]#011Speed: 4155.994980 samples/sec#011accuracy=0.100879
 2017-11-03 12:48:25,538 INFO - root - Epoch [1] Batch [10]#011Speed: 4066.210430 samples/sec#011accuracy=0.099565
 2017-11-03 12:48:26,030 INFO - root - Epoch [1] Batch [11]#011Speed: 4165.489309 samples/sec#011accuracy=0.099528
 2017-11-03 12:48:26,526 INFO - root - Epoch [1] Batch [12]#011Speed: 4131.281781 samples/sec#011accuracy=0.099835
 2017-11-03 12:48:27,021 INFO - root - Epoch [1] Batch [13]#011Speed: 4139.398357 samples/sec#011accuracy=0.099854
 2017-11-03 12:48:27,499 INFO - root - Epoch [1] Batch [14]#011Speed: 4286.133575 samples/sec#011accuracy=0.099902
 2017-11-03 12:48:27,990 INFO - root - Epoch [1] Batch [15]#011Speed: 4169.067705 samples/sec#011accuracy=0.100433
 2017-11-03 12:48:28,483 INFO - root - Epoch [1] Batch [16]#011Speed: 4156.908065 samples/sec#011accuracy=0.100327
 2017-11-03 12:48:28,984 INFO - root - Epoch [1] Batch [17]#011Speed: 4090.265851 samples/sec#011accuracy=0.100505
 2017-11-03 12:48:29,473 INFO - root - Epoch [1] Batch [18]#011Speed: 4187.658546 samples/sec#011accuracy=0.100792
 2017-11-03 12:48:29,971 INFO - root - Epoch [1] Batch [19]#011Speed: 4117.775307 samples/sec#011accuracy=0.100757
 2017-11-03 12:48:30,444 INFO - root - Epoch [1] Batch [20]#011Speed: 4333.603538 samples/sec#011accuracy=0.100818
 2017-11-03 12:48:30,943 INFO - root - Epoch [1] Batch [21]#011Speed: 4099.814095 samples/sec#011accuracy=0.101185
 2017-11-03 12:48:31,428 INFO - root - Epoch [1] Batch [22]#011Speed: 4231.692908 samples/sec#011accuracy=0.101159
 2017-11-03 12:48:31,905 INFO - root - Epoch [1] Batch [23]#011Speed: 4293.005393 samples/sec#011accuracy=0.100993
 2017-11-03 12:48:31,905 INFO - root - [Epoch 1] training: accuracy=0.100993
 2017-11-03 12:48:31,905 INFO - root - [Epoch 1] time cost: 11.829044
 2017-11-03 12:48:32,945 INFO - root - [Epoch 1] validation: accuracy=0.100195
 2017-11-03 12:48:33,879 INFO - root - Epoch [2] Batch [1]#011Speed: 4462.542459 samples/sec#011accuracy=0.093994
 2017-11-03 12:48:34,367 INFO - root - Epoch [2] Batch [2]#011Speed: 4195.029117 samples/sec#011accuracy=0.095378
 2017-11-03 12:48:34,854 INFO - root - Epoch [2] Batch [3]#011Speed: 4209.722520 samples/sec#011accuracy=0.099609
 2017-11-03 12:48:35,331 INFO - root - Epoch [2] Batch [4]#011Speed: 4297.195392 samples/sec#011accuracy=0.097461
 2017-11-03 12:48:35,824 INFO - root - Epoch [2] Batch [5]#011Speed: 4154.563812 samples/sec#011accuracy=0.097493
 2017-11-03 12:48:36,336 INFO - root - Epoch [2] Batch [6]#011Speed: 3997.298472 samples/sec#011accuracy=0.096959
 2017-11-03 12:48:36,837 INFO - root - Epoch [2] Batch [7]#011Speed: 4092.154005 samples/sec#011accuracy=0.097839
 2017-11-03 12:48:37,345 INFO - root - Epoch [2] Batch [8]#011Speed: 4033.433423 samples/sec#011accuracy=0.098579
 
 ...

 2017-11-03 12:59:07,585 INFO - root - Epoch [49] Batch [20]#011Speed: 4135.936405 samples/sec#011accuracy=0.102121
 2017-11-03 12:59:08,084 INFO - root - Epoch [49] Batch [21]#011Speed: 4108.638486 samples/sec#011accuracy=0.102539
 2017-11-03 12:59:08,565 INFO - root - Epoch [49] Batch [22]#011Speed: 4260.070925 samples/sec#011accuracy=0.102475
 2017-11-03 12:59:09,062 INFO - root - Epoch [49] Batch [23]#011Speed: 4117.798994 samples/sec#011accuracy=0.102376
 2017-11-03 12:59:09,063 INFO - root - [Epoch 49] training: accuracy=0.102376
 2017-11-03 12:59:09,063 INFO - root - [Epoch 49] time cost: 11.872868
 2017-11-03 12:59:10,256 INFO - root - [Epoch 49] validation: accuracy=0.100098

What about dist_sync? Is it the same configuration for batch size and learning rate?

Yes, we use exactly the same batch size and learning rate. The only thing that differs is the mode we use to create the kv store.

We’ve experienced this problem with both Gluon and Module.

Here is training log from same job run on 2 x p2.16xl with dist_sync. The job seems to learn nothing for first 27 epochs, then starts learning for rest of job, but still ends up at much lower accuracy than we would expect.

batch_size and learning rate still (128, 0.1) * 16 gpus


 2017-11-03 22:25:25,661 INFO - mxnet_container.train - Starting distributed training task
 [22:25:25] src/io/iter_image_recordio_2.cc:169: ImageRecordIOParser2: /opt/ml/input/data/training/train.rec, use 31 threads for decoding..
 017-11-03 22:25:28,741 INFO - mxnet_container.train - Starting distributed training task
 [22:25:28] src/io/iter_image_recordio_2.cc:169: ImageRecordIOParser2: /opt/ml/input/data/training/train.rec, use 31 threads for decoding..
 [22:26:10] src/io/iter_image_recordio_2.cc:169: ImageRecordIOParser2: /opt/ml/input/data/training/test.rec, use 31 threads for decoding..
 [22:26:13] src/io/iter_image_recordio_2.cc:169: ImageRecordIOParser2: /opt/ml/input/data/training/test.rec, use 31 threads for decoding..
 [22:26:28] src/operator/././cudnn_algoreg-inl.h:106: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
 [22:26:31] src/operator/././cudnn_algoreg-inl.h:106: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
 2017-11-03 22:26:45,552 INFO - root - Epoch [0] Batch [1]#011Speed: 4045.751054 samples/sec#011accuracy=0.095703
 2017-11-03 22:26:45,533 INFO - root - Epoch [0] Batch [1]#011Speed: 4085.256768 samples/sec#011accuracy=0.093750
 2017-11-03 22:26:46,138 INFO - root - Epoch [0] Batch [2]#011Speed: 3502.770265 samples/sec#011accuracy=0.097331
 2017-11-03 22:26:46,690 INFO - root - Epoch [0] Batch [3]#011Speed: 3710.615808 samples/sec#011accuracy=0.094116
 2017-11-03 22:26:46,119 INFO - root - Epoch [0] Batch [2]#011Speed: 3495.596123 samples/sec#011accuracy=0.095703
 2017-11-03 22:26:46,644 INFO - root - Epoch [0] Batch [3]#011Speed: 3900.186517 samples/sec#011accuracy=0.092529
 2017-11-03 22:26:47,210 INFO - root - Epoch [0] Batch [4]#011Speed: 3935.812590 samples/sec#011accuracy=0.096289
 2017-11-03 22:26:47,737 INFO - root - Epoch [0] Batch [5]#011Speed: 3890.555702 samples/sec#011accuracy=0.095947
 2017-11-03 22:26:47,168 INFO - root - Epoch [0] Batch [4]#011Speed: 3914.267588 samples/sec#011accuracy=0.094531
 2017-11-03 22:26:47,685 INFO - root - Epoch [0] Batch [5]#011Speed: 3966.647853 samples/sec#011accuracy=0.093424
 2017-11-03 22:26:48,256 INFO - root - Epoch [0] Batch [6]#011Speed: 3944.484105 samples/sec#011accuracy=0.098424
 2017-11-03 22:26:48,778 INFO - root - Epoch [0] Batch [7]#011Speed: 3925.725347 samples/sec#011accuracy=0.098083
 2017-11-03 22:26:48,237 INFO - root - Epoch [0] Batch [6]#011Speed: 3710.952444 samples/sec#011accuracy=0.095843
 2017-11-03 22:26:48,727 INFO - root - Epoch [0] Batch [7]#011Speed: 4178.312976 samples/sec#011accuracy=0.095703
 2017-11-03 22:26:49,322 INFO - root - Epoch [0] Batch [8]#011Speed: 3765.308833 samples/sec#011accuracy=0.100043
 2017-11-03 22:26:49,848 INFO - root - Epoch [0] Batch [9]#011Speed: 3893.299482 samples/sec#011accuracy=0.100586
 2017-11-03 22:26:49,259 INFO - root - Epoch [0] Batch [8]#011Speed: 3855.778681 samples/sec#011accuracy=0.097222
 2017-11-03 22:26:49,762 INFO - root - Epoch [0] Batch [9]#011Speed: 4072.622472 samples/sec#011accuracy=0.098193
 2017-11-03 22:26:50,355 INFO - root - Epoch [0] Batch [10]#011Speed: 4046.526740 samples/sec#011accuracy=0.102184
 2017-11-03 22:26:50,879 INFO - root - Epoch [0] Batch [11]#011Speed: 3908.546857 samples/sec#011accuracy=0.102865
 2017-11-03 22:26:50,282 INFO - root - Epoch [0] Batch [10]#011Speed: 3942.100064 samples/sec#011accuracy=0.100186
 2017-11-03 22:26:50,807 INFO - root - Epoch [0] Batch [11]#011Speed: 3901.570038 samples/sec#011accuracy=0.101115
 2017-11-03 22:26:51,392 INFO - root - Epoch [0] Batch [12]#011Speed: 3994.211182 samples/sec#011accuracy=0.102314
 2017-11-03 22:26:51,913 INFO - root - Epoch [0] Batch [13]#011Speed: 3928.594436 samples/sec#011accuracy=0.102295
 2017-11-03 22:26:51,318 INFO - root - Epoch [0] Batch [12]#011Speed: 4009.577592 samples/sec#011accuracy=0.100548
 2017-11-03 22:26:51,834 INFO - root - Epoch [0] Batch [13]#011Speed: 3967.047207 samples/sec#011accuracy=0.100516
 2017-11-03 22:26:52,357 INFO - root - Epoch [0] Batch [14]#011Speed: 3918.576288 samples/sec#011accuracy=0.101237
 2017-11-03 22:26:52,893 INFO - root - Epoch [0] Batch [15]#011Speed: 3823.550308 samples/sec#011accuracy=0.101440
 2017-11-03 22:26:52,438 INFO - root - Epoch [0] Batch [14]#011Speed: 3908.404586 samples/sec#011accuracy=0.103027
 2017-11-03 22:26:53,376 INFO - root - Epoch [0] Batch [16]#011Speed: 4243.581764 samples/sec#011accuracy=0.101534
 2017-11-03 22:26:53,885 INFO - root - Epoch [0] Batch [17]#011Speed: 4024.611974 samples/sec#011accuracy=0.101237
 2017-11-03 22:26:52,956 INFO - root - Epoch [0] Batch [15]#011Speed: 3949.549565 samples/sec#011accuracy=0.103180
 2017-11-03 22:26:53,490 INFO - root - Epoch [0] Batch [16]#011Speed: 3837.368748 samples/sec#011accuracy=0.103257
 2017-11-03 22:26:54,412 INFO - root - Epoch [0] Batch [18]#011Speed: 3884.600085 samples/sec#011accuracy=0.102128
 2017-11-03 22:26:54,932 INFO - root - Epoch [0] Batch [19]#011Speed: 3946.482982 samples/sec#011accuracy=0.101709
 2017-11-03 22:26:53,995 INFO - root - Epoch [0] Batch [17]#011Speed: 4064.411520 samples/sec#011accuracy=0.102865
 2017-11-03 22:26:54,522 INFO - root - Epoch [0] Batch [18]#011Speed: 3883.216282 samples/sec#011accuracy=0.103850
 2017-11-03 22:26:55,444 INFO - root - Epoch [0] Batch [20]#011Speed: 3995.560007 samples/sec#011accuracy=0.101609
 2017-11-03 22:26:55,063 INFO - root - Epoch [0] Batch [19]#011Speed: 3787.262089 samples/sec#011accuracy=0.103394
 2017-11-03 22:26:55,592 INFO - root - Epoch [0] Batch [20]#011Speed: 3874.670196 samples/sec#011accuracy=0.103260
 2017-11-03 22:26:56,104 INFO - root - Epoch [0] Batch [21]#011Speed: 3999.109198 samples/sec#011accuracy=0.103671
 2017-11-03 22:26:56,613 INFO - root - Epoch [0] Batch [22]#011Speed: 4027.563274 samples/sec#011accuracy=0.103070
 2017-11-03 22:26:56,001 INFO - root - Epoch [0] Batch [21]#011Speed: 3682.970011 samples/sec#011accuracy=0.102051
 2017-11-03 22:26:56,520 INFO - root - Epoch [0] Batch [22]#011Speed: 3941.904690 samples/sec#011accuracy=0.101562
 2017-11-03 22:26:57,120 INFO - root - Epoch [0] Batch [23]#011Speed: 4038.038989 samples/sec#011accuracy=0.103068
 2017-11-03 22:26:57,689 INFO - root - Epoch [0] Batch [24]#011Speed: 3600.600493 samples/sec#011accuracy=0.103516
 2017-11-03 22:26:57,690 INFO - root - [Epoch 0] training: accuracy=0.103516
 2017-11-03 22:26:57,690 INFO - root - [Epoch 0] time cost: 43.874187
 2017-11-03 22:26:57,024 INFO - root - Epoch [0] Batch [23]#011Speed: 4067.406094 samples/sec#011accuracy=0.101644
 2017-11-03 22:26:57,557 INFO - root - Epoch [0] Batch [24]#011Speed: 3849.827179 samples/sec#011accuracy=0.102070
 2017-11-03 22:26:57,557 INFO - root - [Epoch 0] training: accuracy=0.102070
 2017-11-03 22:26:57,557 INFO - root - [Epoch 0] time cost: 47.130600
 2017-11-03 22:26:58,755 INFO - root - [Epoch 0] validation: accuracy=0.091797
 2017-11-03 22:26:58,688 INFO - root - [Epoch 0] validation: accuracy=0.091309
 
...

 2017-11-03 22:33:00,711 INFO - root - Epoch [27] Batch [1]#011Speed: 3891.532157 samples/sec#011accuracy=0.103271
 2017-11-03 22:33:00,497 INFO - root - Epoch [27] Batch [1]#011Speed: 4159.136148 samples/sec#011accuracy=0.103271
 2017-11-03 22:33:01,009 INFO - root - Epoch [27] Batch [2]#011Speed: 4004.935849 samples/sec#011accuracy=0.105306
 2017-11-03 22:33:01,233 INFO - root - Epoch [27] Batch [2]#011Speed: 3928.166859 samples/sec#011accuracy=0.105306
 2017-11-03 22:33:01,768 INFO - root - Epoch [27] Batch [3]#011Speed: 3830.513682 samples/sec#011accuracy=0.102539
 2017-11-03 22:33:01,527 INFO - root - Epoch [27] Batch [3]#011Speed: 3955.400149 samples/sec#011accuracy=0.102539
 2017-11-03 22:33:02,044 INFO - root - Epoch [27] Batch [4]#011Speed: 3962.825092 samples/sec#011accuracy=0.102734
 2017-11-03 22:33:02,287 INFO - root - Epoch [27] Batch [4]#011Speed: 3946.368758 samples/sec#011accuracy=0.102734
 2017-11-03 22:33:02,837 INFO - root - Epoch [27] Batch [5]#011Speed: 3725.492273 samples/sec#011accuracy=0.102702
 2017-11-03 22:33:03,372 INFO - root - Epoch [27] Batch [6]#011Speed: 3826.198655 samples/sec#011accuracy=0.101772
 2017-11-03 22:33:03,888 INFO - root - Epoch [27] Batch [7]#011Speed: 3969.885299 samples/sec#011accuracy=0.101257
 2017-11-03 22:33:04,415 INFO - root - Epoch [27] Batch [8]#011Speed: 3889.454694 samples/sec#011accuracy=0.101508
 2017-11-03 22:33:04,938 INFO - root - Epoch [27] Batch [9]#011Speed: 3921.651558 samples/sec#011accuracy=0.101904
 2017-11-03 22:33:02,571 INFO - root - Epoch [27] Batch [5]#011Speed: 3885.278298 samples/sec#011accuracy=0.102702
 2017-11-03 22:33:03,079 INFO - root - Epoch [27] Batch [6]#011Speed: 4032.980828 samples/sec#011accuracy=0.101772
 2017-11-03 22:33:03,610 INFO - root - Epoch [27] Batch [7]#011Speed: 3858.742270 samples/sec#011accuracy=0.101257
 2017-11-03 22:33:04,141 INFO - root - Epoch [27] Batch [8]#011Speed: 3858.835877 samples/sec#011accuracy=0.101508
 2017-11-03 22:33:04,675 INFO - root - Epoch [27] Batch [9]#011Speed: 3838.001415 samples/sec#011accuracy=0.101904
 2017-11-03 22:33:05,466 INFO - root - Epoch [27] Batch [10]#011Speed: 3880.898691 samples/sec#011accuracy=0.100231
 2017-11-03 22:33:06,007 INFO - root - Epoch [27] Batch [11]#011Speed: 3787.270438 samples/sec#011accuracy=0.100464
 2017-11-03 22:33:05,210 INFO - root - Epoch [27] Batch [10]#011Speed: 3827.787719 samples/sec#011accuracy=0.100231
 2017-11-03 22:33:05,741 INFO - root - Epoch [27] Batch [11]#011Speed: 3860.712431 samples/sec#011accuracy=0.100464
 2017-11-03 22:33:06,552 INFO - root - Epoch [27] Batch [12]#011Speed: 3758.812203 samples/sec#011accuracy=0.100323
 2017-11-03 22:33:07,070 INFO - root - Epoch [27] Batch [13]#011Speed: 3954.653540 samples/sec#011accuracy=0.101144
 2017-11-03 22:33:06,254 INFO - root - Epoch [27] Batch [12]#011Speed: 3995.691966 samples/sec#011accuracy=0.100323
 2017-11-03 22:33:06,801 INFO - root - Epoch [27] Batch [13]#011Speed: 3743.853362 samples/sec#011accuracy=0.101144
 2017-11-03 22:33:07,587 INFO - root - Epoch [27] Batch [14]#011Speed: 3961.828982 samples/sec#011accuracy=0.101270
 2017-11-03 22:33:08,107 INFO - root - Epoch [27] Batch [15]#011Speed: 3936.039824 samples/sec#011accuracy=0.101013
 2017-11-03 22:33:08,613 INFO - root - Epoch [27] Batch [16]#011Speed: 4052.253519 samples/sec#011accuracy=0.100615
 2017-11-03 22:33:09,135 INFO - root - Epoch [27] Batch [17]#011Speed: 3921.884323 samples/sec#011accuracy=0.100667
 2017-11-03 22:33:09,646 INFO - root - Epoch [27] Batch [18]#011Speed: 4013.883091 samples/sec#011accuracy=0.101203
 2017-11-03 22:33:07,319 INFO - root - Epoch [27] Batch [14]#011Speed: 3956.859580 samples/sec#011accuracy=0.101270
 2017-11-03 22:33:07,853 INFO - root - Epoch [27] Batch [15]#011Speed: 3837.535038 samples/sec#011accuracy=0.101013
 2017-11-03 22:33:08,369 INFO - root - Epoch [27] Batch [16]#011Speed: 3965.957418 samples/sec#011accuracy=0.100615
 2017-11-03 22:33:08,898 INFO - root - Epoch [27] Batch [17]#011Speed: 3872.495467 samples/sec#011accuracy=0.100667
 2017-11-03 22:33:09,407 INFO - root - Epoch [27] Batch [18]#011Speed: 4027.119548 samples/sec#011accuracy=0.101203
 2017-11-03 22:33:09,934 INFO - root - Epoch [27] Batch [19]#011Speed: 3888.421195 samples/sec#011accuracy=0.100879
 2017-11-03 22:33:10,165 INFO - root - Epoch [27] Batch [19]#011Speed: 3946.909116 samples/sec#011accuracy=0.100879
 2017-11-03 22:33:10,666 INFO - root - Epoch [27] Batch [20]#011Speed: 4085.966047 samples/sec#011accuracy=0.100865
 2017-11-03 22:33:10,474 INFO - root - Epoch [27] Batch [20]#011Speed: 3795.257170 samples/sec#011accuracy=0.100865
 2017-11-03 22:33:10,964 INFO - root - Epoch [27] Batch [21]#011Speed: 4185.373296 samples/sec#011accuracy=0.100741
 2017-11-03 22:33:11,186 INFO - root - Epoch [27] Batch [21]#011Speed: 3944.308417 samples/sec#011accuracy=0.100741
 2017-11-03 22:33:11,739 INFO - root - Epoch [27] Batch [22]#011Speed: 3702.325023 samples/sec#011accuracy=0.100650
 2017-11-03 22:33:11,482 INFO - root - Epoch [27] Batch [22]#011Speed: 3950.205234 samples/sec#011accuracy=0.100650
 2017-11-03 22:33:11,998 INFO - root - Epoch [27] Batch [23]#011Speed: 3972.811973 samples/sec#011accuracy=0.100728
 2017-11-03 22:33:11,998 INFO - root - [Epoch 27] training: accuracy=0.100728
 2017-11-03 22:33:11,999 INFO - root - [Epoch 27] time cost: 12.480929
 2017-11-03 22:33:12,261 INFO - root - Epoch [27] Batch [23]#011Speed: 3928.948425 samples/sec#011accuracy=0.100728
 2017-11-03 22:33:12,261 INFO - root - [Epoch 27] training: accuracy=0.100728
 2017-11-03 22:33:12,261 INFO - root - [Epoch 27] time cost: 12.609994
 2017-11-03 22:33:13,179 INFO - root - [Epoch 27] validation: accuracy=0.099316
 2017-11-03 22:33:13,260 INFO - root - [Epoch 27] validation: accuracy=0.099316
 2017-11-03 22:33:14,169 INFO - root - Epoch [28] Batch [1]#011Speed: 4089.488882 samples/sec#011accuracy=0.110107
 2017-11-03 22:33:14,315 INFO - root - Epoch [28] Batch [1]#011Speed: 3956.608066 samples/sec#011accuracy=0.110352
 2017-11-03 22:33:14,841 INFO - root - Epoch [28] Batch [2]#011Speed: 3896.230920 samples/sec#011accuracy=0.107910
 2017-11-03 22:33:14,699 INFO - root - Epoch [28] Batch [2]#011Speed: 3868.099327 samples/sec#011accuracy=0.107747
 2017-11-03 22:33:15,209 INFO - root - Epoch [28] Batch [3]#011Speed: 4020.479134 samples/sec#011accuracy=0.107544
 2017-11-03 22:33:15,754 INFO - root - Epoch [28] Batch [4]#011Speed: 3760.192691 samples/sec#011accuracy=0.105371
 2017-11-03 22:33:16,261 INFO - root - Epoch [28] Batch [5]#011Speed: 4036.096127 samples/sec#011accuracy=0.104492
 2017-11-03 22:33:16,784 INFO - root - Epoch [28] Batch [6]#011Speed: 3917.107444 samples/sec#011accuracy=0.104562
 2017-11-03 22:33:15,358 INFO - root - Epoch [28] Batch [3]#011Speed: 3962.134159 samples/sec#011accuracy=0.107178
 2017-11-03 22:33:15,848 INFO - root - Epoch [28] Batch [4]#011Speed: 4185.179573 samples/sec#011accuracy=0.105078
 2017-11-03 22:33:16,405 INFO - root - Epoch [28] Batch [5]#011Speed: 3677.936449 samples/sec#011accuracy=0.104248
 2017-11-03 22:33:16,914 INFO - root - Epoch [28] Batch [6]#011Speed: 4023.884250 samples/sec#011accuracy=0.104422
 2017-11-03 22:33:17,465 INFO - root - Epoch [28] Batch [7]#011Speed: 3719.645939 samples/sec#011accuracy=0.106323
 2017-11-03 22:33:17,971 INFO - root - Epoch [28] Batch [8]#011Speed: 4045.057572 samples/sec#011accuracy=0.106717
 2017-11-03 22:33:17,308 INFO - root - Epoch [28] Batch [7]#011Speed: 3908.575312 samples/sec#011accuracy=0.106567
 2017-11-03 22:33:17,836 INFO - root - Epoch [28] Batch [8]#011Speed: 3882.089598 samples/sec#011accuracy=0.107368
 2017-11-03 22:33:18,487 INFO - root - Epoch [28] Batch [9]#011Speed: 3976.620898 samples/sec#011accuracy=0.107813
 2017-11-03 22:33:18,998 INFO - root - Epoch [28] Batch [10]#011Speed: 4009.223895 samples/sec#011accuracy=0.106756
 2017-11-03 22:33:18,350 INFO - root - Epoch [28] Batch [9]#011Speed: 3986.886613 samples/sec#011accuracy=0.108301
 2017-11-03 22:33:18,881 INFO - root - Epoch [28] Batch [10]#011Speed: 3858.471877 samples/sec#011accuracy=0.107289
 2017-11-03 22:33:19,512 INFO - root - Epoch [28] Batch [11]#011Speed: 3980.702756 samples/sec#011accuracy=0.106689
 2017-11-03 22:33:20,037 INFO - root - Epoch [28] Batch [12]#011Speed: 3906.122560 samples/sec#011accuracy=0.107197
 2017-11-03 22:33:19,367 INFO - root - Epoch [28] Batch [11]#011Speed: 4212.614888 samples/sec#011accuracy=0.106934
 2017-11-03 22:33:19,900 INFO - root - Epoch [28] Batch [12]#011Speed: 3844.103884 samples/sec#011accuracy=0.107459
 2017-11-03 22:33:20,444 INFO - root - Epoch [28] Batch [13]#011Speed: 3769.861415 samples/sec#011accuracy=0.107247
 2017-11-03 22:33:20,943 INFO - root - Epoch [28] Batch [14]#011Speed: 4105.474992 samples/sec#011accuracy=0.108236
 2017-11-03 22:33:21,463 INFO - root - Epoch [28] Batch [15]#011Speed: 3942.007801 samples/sec#011accuracy=0.108795
 2017-11-03 22:33:21,979 INFO - root - Epoch [28] Batch [16]#011Speed: 3968.539082 samples/sec#011accuracy=0.109404
 2017-11-03 22:33:20,557 INFO - root - Epoch [28] Batch [13]#011Speed: 3939.529185 samples/sec#011accuracy=0.106934
 2017-11-03 22:33:21,075 INFO - root - Epoch [28] Batch [14]#011Speed: 3954.422331 samples/sec#011accuracy=0.107878
 2017-11-03 22:33:21,574 INFO - root - Epoch [28] Batch [15]#011Speed: 4109.338215 samples/sec#011accuracy=0.108307
 2017-11-03 22:33:22,082 INFO - root - Epoch [28] Batch [16]#011Speed: 4026.343734 samples/sec#011accuracy=0.108887
 2017-11-03 22:33:22,626 INFO - root - Epoch [28] Batch [17]#011Speed: 3768.439101 samples/sec#011accuracy=0.109429
 2017-11-03 22:33:23,154 INFO - root - Epoch [28] Batch [18]#011Speed: 3883.539316 samples/sec#011accuracy=0.109606
 2017-11-03 22:33:22,517 INFO - root - Epoch [28] Batch [17]#011Speed: 3806.761548 samples/sec#011accuracy=0.109945
 2017-11-03 22:33:23,041 INFO - root - Epoch [28] Batch [18]#011Speed: 3913.010514 samples/sec#011accuracy=0.110223
 2017-11-03 22:33:23,678 INFO - root - Epoch [28] Batch [19]#011Speed: 3908.054289 samples/sec#011accuracy=0.110669
 2017-11-03 22:33:23,555 INFO - root - Epoch [28] Batch [19]#011Speed: 3987.671360 samples/sec#011accuracy=0.111108
 2017-11-03 22:33:24,079 INFO - root - Epoch [28] Batch [20]#011Speed: 3907.851608 samples/sec#011accuracy=0.111630
 2017-11-03 22:33:24,170 INFO - root - Epoch [28] Batch [20]#011Speed: 4162.263917 samples/sec#011accuracy=0.111096
 2017-11-03 22:33:24,720 INFO - root - Epoch [28] Batch [21]#011Speed: 3726.141922 samples/sec#011accuracy=0.111395
 2017-11-03 22:33:24,666 INFO - root - Epoch [28] Batch [21]#011Speed: 3493.122722 samples/sec#011accuracy=0.111950
 2017-11-03 22:33:25,148 INFO - root - Epoch [28] Batch [22]#011Speed: 4248.125342 samples/sec#011accuracy=0.112220
 2017-11-03 22:33:25,639 INFO - root - Epoch [28] Batch [23]#011Speed: 4177.947174 samples/sec#011accuracy=0.113322
 2017-11-03 22:33:26,172 INFO - root - Epoch [28] Batch [24]#011Speed: 3839.627760 samples/sec#011accuracy=0.113281
 2017-11-03 22:33:26,173 INFO - root - [Epoch 28] training: accuracy=0.113281
 2017-11-03 22:33:26,173 INFO - root - [Epoch 28] time cost: 12.993251
 2017-11-03 22:33:25,251 INFO - root - Epoch [28] Batch [22]#011Speed: 3857.295414 samples/sec#011accuracy=0.111753
 2017-11-03 22:33:25,763 INFO - root - Epoch [28] Batch [23]#011Speed: 4002.282386 samples/sec#011accuracy=0.112854
 2017-11-03 22:33:26,277 INFO - root - Epoch [28] Batch [24]#011Speed: 3988.563828 samples/sec#011accuracy=0.112773
 2017-11-03 22:33:26,277 INFO - root - [Epoch 28] training: accuracy=0.112773
 2017-11-03 22:33:26,277 INFO - root - [Epoch 28] time cost: 13.016735
 2017-11-03 22:33:27,348 INFO - root - [Epoch 28] validation: accuracy=0.128809
 2017-11-03 22:33:27,301 INFO - root - [Epoch 28] validation: accuracy=0.129199
 2017-11-03 22:33:28,883 INFO - root - Epoch [29] Batch [1]#011Speed: 3880.190457 samples/sec#011accuracy=0.120117
 2017-11-03 22:33:28,601 INFO - root - Epoch [29] Batch [1]#011Speed: 2550.318582 samples/sec#011accuracy=0.118896
 2017-11-03 22:33:29,120 INFO - root - Epoch [29] Batch [2]#011Speed: 3949.809264 samples/sec#011accuracy=0.122396
 2017-11-03 22:33:29,417 INFO - root - Epoch [29] Batch [2]#011Speed: 3836.218822 samples/sec#011accuracy=0.122884
 2017-11-03 22:33:29,936 INFO - root - Epoch [29] Batch [3]#011Speed: 3946.125827 samples/sec#011accuracy=0.121948
 2017-11-03 22:33:29,666 INFO - root - Epoch [29] Batch [3]#011Speed: 3750.874778 samples/sec#011accuracy=0.121338
 2017-11-03 22:33:30,178 INFO - root - Epoch [29] Batch [4]#011Speed: 4003.565760 samples/sec#011accuracy=0.121289
 2017-11-03 22:33:30,458 INFO - root - Epoch [29] Batch [4]#011Speed: 3925.183599 samples/sec#011accuracy=0.121777
 2017-11-03 22:33:30,990 INFO - root - Epoch [29] Batch [5]#011Speed: 3851.043978 samples/sec#011accuracy=0.121989
 2017-11-03 22:33:30,711 INFO - root - Epoch [29] Batch [5]#011Speed: 3842.329369 samples/sec#011accuracy=0.122070
 2017-11-03 22:33:31,514 INFO - root - Epoch [29] Batch [6]#011Speed: 3908.920366 samples/sec#011accuracy=0.125000
 2017-11-03 22:33:32,049 INFO - root - Epoch [29] Batch [7]#011Speed: 3832.034532 samples/sec#011accuracy=0.126221
 2017-11-03 22:33:31,260 INFO - root - Epoch [29] Batch [6]#011Speed: 3730.352355 samples/sec#011accuracy=0.122977
 2017-11-03 22:33:31,795 INFO - root - Epoch [29] Batch [7]#011Speed: 3832.120009 samples/sec#011accuracy=0.124512
 2017-11-03 22:33:32,572 INFO - root - Epoch [29] Batch [8]#011Speed: 3922.154723 samples/sec#011accuracy=0.127441
 2017-11-03 22:33:33,097 INFO - root - Epoch [29] Batch [9]#011Speed: 3896.959166 samples/sec#011accuracy=0.128223
 2017-11-03 22:33:33,630 INFO - root - Epoch [29] Batch [10]#011Speed: 3842.906938 samples/sec#011accuracy=0.130194
 2017-11-03 22:33:34,161 INFO - root - Epoch [29] Batch [11]#011Speed: 3863.268310 samples/sec#011accuracy=0.130859
 2017-11-03 22:33:34,678 INFO - root - Epoch [29] Batch [12]#011Speed: 3959.316260 samples/sec#011accuracy=0.131611
 2017-11-03 22:33:32,320 INFO - root - Epoch [29] Batch [8]#011Speed: 3900.886126 samples/sec#011accuracy=0.125163
 2017-11-03 22:33:32,842 INFO - root - Epoch [29] Batch [9]#011Speed: 3928.405788 samples/sec#011accuracy=0.126221
 2017-11-03 22:33:33,364 INFO - root - Epoch [29] Batch [10]#011Speed: 3923.363921 samples/sec#011accuracy=0.128196
 2017-11-03 22:33:33,900 INFO - root - Epoch [29] Batch [11]#011Speed: 3823.242282 samples/sec#011accuracy=0.129150
 2017-11-03 22:33:34,428 INFO - root - Epoch [29] Batch [12]#011Speed: 3879.552566 samples/sec#011accuracy=0.130108
 2017-11-03 22:33:34,969 INFO - root - Epoch [29] Batch [13]#011Speed: 3784.323877 samples/sec#011accuracy=0.129778
 2017-11-03 22:33:35,217 INFO - root - Epoch [29] Batch [13]#011Speed: 3805.572567 samples/sec#011accuracy=0.131208
 2017-11-03 22:33:35,723 INFO - root - Epoch [29] Batch [14]#011Speed: 4047.638377 samples/sec#011accuracy=0.130208
 2017-11-03 22:33:35,502 INFO - root - Epoch [29] Batch [14]#011Speed: 3847.721604 samples/sec#011accuracy=0.128711
 2017-11-03 22:33:36,008 INFO - root - Epoch [29] Batch [15]#011Speed: 4048.678109 samples/sec#011accuracy=0.128235
 2017-11-03 22:33:36,229 INFO - root - Epoch [29] Batch [15]#011Speed: 4050.873794 samples/sec#011accuracy=0.129761
 2017-11-03 22:33:36,740 INFO - root - Epoch [29] Batch [16]#011Speed: 4002.657240 samples/sec#011accuracy=0.130342
 2017-11-03 22:33:36,541 INFO - root - Epoch [29] Batch [16]#011Speed: 3841.508008 samples/sec#011accuracy=0.128964
 2017-11-03 22:33:37,042 INFO - root - Epoch [29] Batch [17]#011Speed: 4090.347655 samples/sec#011accuracy=0.129639
 2017-11-03 22:33:37,260 INFO - root - Epoch [29] Batch [17]#011Speed: 3940.250211 samples/sec#011accuracy=0.131266
 2017-11-03 22:33:37,802 INFO - root - Epoch [29] Batch [18]#011Speed: 3786.492474 samples/sec#011accuracy=0.132299
 2017-11-03 22:33:38,340 INFO - root - Epoch [29] Batch [19]#011Speed: 3807.576555 samples/sec#011accuracy=0.132666
 2017-11-03 22:33:38,867 INFO - root - Epoch [29] Batch [20]#011Speed: 3888.197665 samples/sec#011accuracy=0.133394
 2017-11-03 22:33:39,393 INFO - root - Epoch [29] Batch [21]#011Speed: 3890.259690 samples/sec#011accuracy=0.134011
 2017-11-03 22:33:39,914 INFO - root - Epoch [29] Batch [22]#011Speed: 3934.420899 samples/sec#011accuracy=0.134129
 2017-11-03 22:33:37,543 INFO - root - Epoch [29] Batch [18]#011Speed: 4091.851861 samples/sec#011accuracy=0.130859
 2017-11-03 22:33:38,061 INFO - root - Epoch [29] Batch [19]#011Speed: 3959.026115 samples/sec#011accuracy=0.131128
 2017-11-03 22:33:38,587 INFO - root - Epoch [29] Batch [20]#011Speed: 3895.186751 samples/sec#011accuracy=0.132092
 2017-11-03 22:33:39,106 INFO - root - Epoch [29] Batch [21]#011Speed: 3942.290030 samples/sec#011accuracy=0.132812
 2017-11-03 22:33:39,644 INFO - root - Epoch [29] Batch [22]#011Speed: 3811.205292 samples/sec#011accuracy=0.132940
 2017-11-03 22:33:40,171 INFO - root - Epoch [29] Batch [23]#011Speed: 3890.569799 samples/sec#011accuracy=0.133199
 2017-11-03 22:33:40,171 INFO - root - [Epoch 29] training: accuracy=0.133199
 2017-11-03 22:33:40,171 INFO - root - [Epoch 29] time cost: 12.869955
 2017-11-03 22:33:40,431 INFO - root - Epoch [29] Batch [23]#011Speed: 3960.756668 samples/sec#011accuracy=0.134420
 2017-11-03 22:33:40,432 INFO - root - [Epoch 29] training: accuracy=0.134420
 2017-11-03 22:33:40,432 INFO - root - [Epoch 29] time cost: 12.613915
 2017-11-03 22:33:41,454 INFO - root - [Epoch 29] validation: accuracy=0.138477
 2017-11-03 22:33:41,319 INFO - root - [Epoch 29] validation: accuracy=0.138867
 
...

 2017-11-03 22:38:09,718 INFO - root - Epoch [49] Batch [1]#011Speed: 3840.608006 samples/sec#011accuracy=0.672363
 2017-11-03 22:38:10,244 INFO - root - Epoch [49] Batch [2]#011Speed: 3895.921675 samples/sec#011accuracy=0.671549
 2017-11-03 22:38:10,792 INFO - root - Epoch [49] Batch [3]#011Speed: 3737.329678 samples/sec#011accuracy=0.669067
 2017-11-03 22:38:11,312 INFO - root - Epoch [49] Batch [4]#011Speed: 3937.371297 samples/sec#011accuracy=0.670703
 2017-11-03 22:38:09,425 INFO - root - Epoch [49] Batch [1]#011Speed: 2251.197440 samples/sec#011accuracy=0.676514
 2017-11-03 22:38:09,965 INFO - root - Epoch [49] Batch [2]#011Speed: 3795.644561 samples/sec#011accuracy=0.672363
 2017-11-03 22:38:10,492 INFO - root - Epoch [49] Batch [3]#011Speed: 3886.582678 samples/sec#011accuracy=0.670044
 2017-11-03 22:38:11,053 INFO - root - Epoch [49] Batch [4]#011Speed: 3654.677019 samples/sec#011accuracy=0.671387
 2017-11-03 22:38:11,833 INFO - root - Epoch [49] Batch [5]#011Speed: 3932.547638 samples/sec#011accuracy=0.671549
 2017-11-03 22:38:11,554 INFO - root - Epoch [49] Batch [5]#011Speed: 4086.691126 samples/sec#011accuracy=0.672445
 2017-11-03 22:38:12,088 INFO - root - Epoch [49] Batch [6]#011Speed: 3838.720062 samples/sec#011accuracy=0.672433
 2017-11-03 22:38:12,358 INFO - root - Epoch [49] Batch [6]#011Speed: 3907.480080 samples/sec#011accuracy=0.671247
 2017-11-03 22:38:12,899 INFO - root - Epoch [49] Batch [7]#011Speed: 3782.037866 samples/sec#011accuracy=0.670593
 2017-11-03 22:38:12,629 INFO - root - Epoch [49] Batch [7]#011Speed: 3782.722380 samples/sec#011accuracy=0.671753
 2017-11-03 22:38:13,169 INFO - root - Epoch [49] Batch [8]#011Speed: 3800.898059 samples/sec#011accuracy=0.674642
 2017-11-03 22:38:13,423 INFO - root - Epoch [49] Batch [8]#011Speed: 3909.455854 samples/sec#011accuracy=0.673937
 2017-11-03 22:38:13,949 INFO - root - Epoch [49] Batch [9]#011Speed: 3900.163496 samples/sec#011accuracy=0.673438
 2017-11-03 22:38:13,711 INFO - root - Epoch [49] Batch [9]#011Speed: 3774.099983 samples/sec#011accuracy=0.673877
 2017-11-03 22:38:14,230 INFO - root - Epoch [49] Batch [10]#011Speed: 3954.775528 samples/sec#011accuracy=0.674139
 2017-11-03 22:38:14,464 INFO - root - Epoch [49] Batch [10]#011Speed: 3977.873128 samples/sec#011accuracy=0.674893
 2017-11-03 22:38:14,984 INFO - root - Epoch [49] Batch [11]#011Speed: 3942.335263 samples/sec#011accuracy=0.675171
 2017-11-03 22:38:14,754 INFO - root - Epoch [49] Batch [11]#011Speed: 3906.607534 samples/sec#011accuracy=0.674805
 2017-11-03 22:38:15,275 INFO - root - Epoch [49] Batch [12]#011Speed: 3936.092128 samples/sec#011accuracy=0.674692
 2017-11-03 22:38:15,516 INFO - root - Epoch [49] Batch [12]#011Speed: 3845.120841 samples/sec#011accuracy=0.675669
 2017-11-03 22:38:16,032 INFO - root - Epoch [49] Batch [13]#011Speed: 3972.326956 samples/sec#011accuracy=0.677002
 2017-11-03 22:38:15,785 INFO - root - Epoch [49] Batch [13]#011Speed: 4015.262128 samples/sec#011accuracy=0.675956
 2017-11-03 22:38:16,317 INFO - root - Epoch [49] Batch [14]#011Speed: 3848.640462 samples/sec#011accuracy=0.675033
 2017-11-03 22:38:16,844 INFO - root - Epoch [49] Batch [15]#011Speed: 3889.283874 samples/sec#011accuracy=0.673462
 2017-11-03 22:38:17,368 INFO - root - Epoch [49] Batch [16]#011Speed: 3912.294076 samples/sec#011accuracy=0.673656
 2017-11-03 22:38:17,888 INFO - root - Epoch [49] Batch [17]#011Speed: 3939.270836 samples/sec#011accuracy=0.673367
 2017-11-03 22:38:18,394 INFO - root - Epoch [49] Batch [18]#011Speed: 4047.295096 samples/sec#011accuracy=0.674805
 2017-11-03 22:38:18,934 INFO - root - Epoch [49] Batch [19]#011Speed: 3791.329042 samples/sec#011accuracy=0.675122
 2017-11-03 22:38:16,522 INFO - root - Epoch [49] Batch [14]#011Speed: 4181.750581 samples/sec#011accuracy=0.675944
 2017-11-03 22:38:17,047 INFO - root - Epoch [49] Batch [15]#011Speed: 3902.878289 samples/sec#011accuracy=0.674377
 2017-11-03 22:38:17,559 INFO - root - Epoch [49] Batch [16]#011Speed: 4000.774355 samples/sec#011accuracy=0.674719
 2017-11-03 22:38:18,113 INFO - root - Epoch [49] Batch [17]#011Speed: 3701.922943 samples/sec#011accuracy=0.674561
 2017-11-03 22:38:18,699 INFO - root - Epoch [49] Batch [18]#011Speed: 3494.474129 samples/sec#011accuracy=0.675961
 2017-11-03 22:38:19,192 INFO - root - Epoch [49] Batch [19]#011Speed: 4151.650253 samples/sec#011accuracy=0.676367
 2017-11-03 22:38:19,732 INFO - root - Epoch [49] Batch [20]#011Speed: 3799.507693 samples/sec#011accuracy=0.677176
 2017-11-03 22:38:20,248 INFO - root - Epoch [49] Batch [21]#011Speed: 3965.188515 samples/sec#011accuracy=0.678267
 2017-11-03 22:38:19,483 INFO - root - Epoch [49] Batch [20]#011Speed: 3738.372265 samples/sec#011accuracy=0.676293
 2017-11-03 22:38:20,002 INFO - root - Epoch [49] Batch [21]#011Speed: 3949.026639 samples/sec#011accuracy=0.677246
 2017-11-03 22:38:20,761 INFO - root - Epoch [49] Batch [22]#011Speed: 3996.744230 samples/sec#011accuracy=0.678605
 2017-11-03 22:38:21,288 INFO - root - Epoch [49] Batch [23]#011Speed: 3885.838969 samples/sec#011accuracy=0.678955
 2017-11-03 22:38:21,288 INFO - root - [Epoch 49] training: accuracy=0.678955
 2017-11-03 22:38:21,289 INFO - root - [Epoch 49] time cost: 12.643478
 2017-11-03 22:38:20,508 INFO - root - Epoch [49] Batch [22]#011Speed: 4046.031182 samples/sec#011accuracy=0.677671
 2017-11-03 22:38:21,033 INFO - root - Epoch [49] Batch [23]#011Speed: 3900.418511 samples/sec#011accuracy=0.678162
 2017-11-03 22:38:21,034 INFO - root - [Epoch 49] training: accuracy=0.678162
 2017-11-03 22:38:21,034 INFO - root - [Epoch 49] time cost: 13.004197
 2017-11-03 22:38:22,160 INFO - root - [Epoch 49] validation: accuracy=0.563379

And here is another log. Using dist_sync again, exact same config as previous example. Accuracy rises through early epochs, but slowly and not very steadily, then collapses and bounces up and down for rest of job.


 2017-11-04 09:19:31,445 INFO - mxnet_container.train - Starting distributed training task
 [09:19:31] src/io/iter_image_recordio_2.cc:169: ImageRecordIOParser2: /opt/ml/input/data/training/train.rec, use 31 threads for decoding..
 2017-11-04 09:19:31,850 INFO - mxnet_container.train - Starting distributed training task
 [09:19:31] src/io/iter_image_recordio_2.cc:169: ImageRecordIOParser2: /opt/ml/input/data/training/train.rec, use 31 threads for decoding..
 [09:20:17] src/io/iter_image_recordio_2.cc:169: ImageRecordIOParser2: /opt/ml/input/data/training/test.rec, use 31 threads for decoding..
 [09:20:16] src/io/iter_image_recordio_2.cc:169: ImageRecordIOParser2: /opt/ml/input/data/training/test.rec, use 31 threads for decoding..
 [09:20:35] src/operator/././cudnn_algoreg-inl.h:106: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
 [09:20:35] src/operator/././cudnn_algoreg-inl.h:106: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
 2017-11-04 09:21:01,013 INFO - root - [Epoch 0] training: accuracy=0.098574
 2017-11-04 09:21:01,013 INFO - root - [Epoch 0] time cost: 43.719973
 2017-11-04 09:21:01,296 INFO - root - [Epoch 0] training: accuracy=0.097988
 2017-11-04 09:21:01,296 INFO - root - [Epoch 0] time cost: 44.116472
 2017-11-04 09:21:02,137 INFO - root - [Epoch 0] validation: accuracy=0.091406
 2017-11-04 09:21:02,331 INFO - root - [Epoch 0] validation: accuracy=0.090332
 2017-11-04 09:21:14,716 INFO - root - [Epoch 1] training: accuracy=0.104736
 2017-11-04 09:21:14,717 INFO - root - [Epoch 1] time cost: 12.086653
 2017-11-04 09:21:14,975 INFO - root - [Epoch 1] training: accuracy=0.104533
 2017-11-04 09:21:14,975 INFO - root - [Epoch 1] time cost: 12.642990
 2017-11-04 09:21:15,856 INFO - root - [Epoch 1] validation: accuracy=0.105469
 2017-11-04 09:21:15,970 INFO - root - [Epoch 1] validation: accuracy=0.105566
 2017-11-04 09:21:28,916 INFO - root - [Epoch 2] training: accuracy=0.110625
 2017-11-04 09:21:28,916 INFO - root - [Epoch 2] time cost: 12.602215
 2017-11-04 09:21:29,183 INFO - root - [Epoch 2] training: accuracy=0.110273
 2017-11-04 09:21:29,183 INFO - root - [Epoch 2] time cost: 13.212795
 2017-11-04 09:21:30,050 INFO - root - [Epoch 2] validation: accuracy=0.105371
 2017-11-04 09:21:30,192 INFO - root - [Epoch 2] validation: accuracy=0.105176
 2017-11-04 09:21:42,421 INFO - root - [Epoch 3] training: accuracy=0.114237
 2017-11-04 09:21:42,421 INFO - root - [Epoch 3] time cost: 12.370330
 2017-11-04 09:21:42,681 INFO - root - [Epoch 3] training: accuracy=0.114156
 2017-11-04 09:21:42,681 INFO - root - [Epoch 3] time cost: 12.488236
 2017-11-04 09:21:43,546 INFO - root - [Epoch 3] validation: accuracy=0.116406
 2017-11-04 09:21:43,667 INFO - root - [Epoch 3] validation: accuracy=0.116406
 2017-11-04 09:21:56,602 INFO - root - [Epoch 4] training: accuracy=0.124805
 2017-11-04 09:21:56,603 INFO - root - [Epoch 4] time cost: 12.611940
 2017-11-04 09:21:56,845 INFO - root - [Epoch 4] training: accuracy=0.124961
 2017-11-04 09:21:56,845 INFO - root - [Epoch 4] time cost: 13.177747
 2017-11-04 09:21:57,834 INFO - root - [Epoch 4] validation: accuracy=0.104785
 2017-11-04 09:21:57,731 INFO - root - [Epoch 4] validation: accuracy=0.105566
 2017-11-04 09:22:10,009 INFO - root - [Epoch 5] training: accuracy=0.130127
 2017-11-04 09:22:10,010 INFO - root - [Epoch 5] time cost: 12.278370
 2017-11-04 09:22:10,263 INFO - root - [Epoch 5] training: accuracy=0.130452
 2017-11-04 09:22:10,263 INFO - root - [Epoch 5] time cost: 12.428541
 2017-11-04 09:22:11,118 INFO - root - [Epoch 5] validation: accuracy=0.135742
 2017-11-04 09:22:11,242 INFO - root - [Epoch 5] validation: accuracy=0.135352
 2017-11-04 09:22:23,704 INFO - root - [Epoch 6] training: accuracy=0.142761
 2017-11-04 09:22:23,705 INFO - root - [Epoch 6] time cost: 12.126450
 2017-11-04 09:22:23,982 INFO - root - [Epoch 6] training: accuracy=0.142924
 2017-11-04 09:22:23,982 INFO - root - [Epoch 6] time cost: 12.739633
 2017-11-04 09:22:24,820 INFO - root - [Epoch 6] validation: accuracy=0.104980
 2017-11-04 09:22:24,955 INFO - root - [Epoch 6] validation: accuracy=0.104883
 2017-11-04 09:22:37,612 INFO - root - [Epoch 7] training: accuracy=0.142891
 2017-11-04 09:22:37,612 INFO - root - [Epoch 7] time cost: 12.791654
 2017-11-04 09:22:37,849 INFO - root - [Epoch 7] training: accuracy=0.143535
 2017-11-04 09:22:37,849 INFO - root - [Epoch 7] time cost: 12.893820
 2017-11-04 09:22:38,538 INFO - root - [Epoch 7] validation: accuracy=0.158569
 2017-11-04 09:22:38,642 INFO - root - [Epoch 7] validation: accuracy=0.159058
 2017-11-04 09:22:51,132 INFO - root - [Epoch 8] training: accuracy=0.159078
 2017-11-04 09:22:51,133 INFO - root - [Epoch 8] time cost: 12.111735
 2017-11-04 09:22:51,390 INFO - root - [Epoch 8] training: accuracy=0.159770
 2017-11-04 09:22:51,390 INFO - root - [Epoch 8] time cost: 12.747310
 2017-11-04 09:22:52,283 INFO - root - [Epoch 8] validation: accuracy=0.170313
 2017-11-04 09:22:52,382 INFO - root - [Epoch 8] validation: accuracy=0.170605
 2017-11-04 09:23:05,664 INFO - root - [Epoch 9] training: accuracy=0.170234
 2017-11-04 09:23:05,664 INFO - root - [Epoch 9] time cost: 13.281223
 2017-11-04 09:23:06,650 INFO - root - [Epoch 9] validation: accuracy=0.153320
 2017-11-04 09:23:05,403 INFO - root - [Epoch 9] training: accuracy=0.170059
 2017-11-04 09:23:05,403 INFO - root - [Epoch 9] time cost: 12.627474
 2017-11-04 09:23:06,528 INFO - root - [Epoch 9] validation: accuracy=0.153027
 2017-11-04 09:23:18,769 INFO - root - [Epoch 10] training: accuracy=0.179036
 2017-11-04 09:23:18,769 INFO - root - [Epoch 10] time cost: 12.240948
 2017-11-04 09:23:19,041 INFO - root - [Epoch 10] training: accuracy=0.179016
 2017-11-04 09:23:19,041 INFO - root - [Epoch 10] time cost: 12.390525
 2017-11-04 09:23:19,899 INFO - root - [Epoch 10] validation: accuracy=0.186914
 2017-11-04 09:23:20,031 INFO - root - [Epoch 10] validation: accuracy=0.186914
 2017-11-04 09:23:32,326 INFO - root - [Epoch 11] training: accuracy=0.197062
 2017-11-04 09:23:32,326 INFO - root - [Epoch 11] time cost: 11.943790
 2017-11-04 09:23:32,572 INFO - root - [Epoch 11] training: accuracy=0.196391
 2017-11-04 09:23:32,572 INFO - root - [Epoch 11] time cost: 12.540984
 2017-11-04 09:23:33,437 INFO - root - [Epoch 11] validation: accuracy=0.177832
 2017-11-04 09:23:33,566 INFO - root - [Epoch 11] validation: accuracy=0.177539
 2017-11-04 09:23:46,132 INFO - root - [Epoch 12] training: accuracy=0.192344
 2017-11-04 09:23:46,132 INFO - root - [Epoch 12] time cost: 12.695587
 2017-11-04 09:23:46,411 INFO - root - [Epoch 12] training: accuracy=0.192246
 2017-11-04 09:23:46,411 INFO - root - [Epoch 12] time cost: 12.845486
 2017-11-04 09:23:47,312 INFO - root - [Epoch 12] validation: accuracy=0.184180
 2017-11-04 09:23:47,412 INFO - root - [Epoch 12] validation: accuracy=0.184473
 2017-11-04 09:23:59,553 INFO - root - [Epoch 13] training: accuracy=0.194967
 2017-11-04 09:23:59,553 INFO - root - [Epoch 13] time cost: 12.240608
 2017-11-04 09:23:59,786 INFO - root - [Epoch 13] training: accuracy=0.195089
 2017-11-04 09:23:59,787 INFO - root - [Epoch 13] time cost: 12.374072
 2017-11-04 09:24:00,665 INFO - root - [Epoch 13] validation: accuracy=0.211719
 2017-11-04 09:24:00,797 INFO - root - [Epoch 13] validation: accuracy=0.211523
 2017-11-04 09:24:13,738 INFO - root - [Epoch 14] training: accuracy=0.200664
 2017-11-04 09:24:13,738 INFO - root - [Epoch 14] time cost: 12.615486
 2017-11-04 09:24:14,023 INFO - root - [Epoch 14] training: accuracy=0.200723
 2017-11-04 09:24:14,023 INFO - root - [Epoch 14] time cost: 13.225932
 2017-11-04 09:24:14,912 INFO - root - [Epoch 14] validation: accuracy=0.211426
 2017-11-04 09:24:15,019 INFO - root - [Epoch 14] validation: accuracy=0.211523
 2017-11-04 09:24:27,153 INFO - root - [Epoch 15] training: accuracy=0.213908
 2017-11-04 09:24:27,153 INFO - root - [Epoch 15] time cost: 12.240951
 2017-11-04 09:24:27,349 INFO - root - [Epoch 15] training: accuracy=0.214274
 2017-11-04 09:24:27,349 INFO - root - [Epoch 15] time cost: 12.329125
 2017-11-04 09:24:28,227 INFO - root - [Epoch 15] validation: accuracy=0.196191
 2017-11-04 09:24:28,336 INFO - root - [Epoch 15] validation: accuracy=0.196289
 2017-11-04 09:24:40,983 INFO - root - [Epoch 16] training: accuracy=0.219297
 2017-11-04 09:24:40,983 INFO - root - [Epoch 16] time cost: 12.755961
 2017-11-04 09:24:41,253 INFO - root - [Epoch 16] training: accuracy=0.220059
 2017-11-04 09:24:41,253 INFO - root - [Epoch 16] time cost: 12.916951
 2017-11-04 09:24:41,952 INFO - root - [Epoch 16] validation: accuracy=0.223022
 2017-11-04 09:24:42,085 INFO - root - [Epoch 16] validation: accuracy=0.223511
 2017-11-04 09:24:54,514 INFO - root - [Epoch 17] training: accuracy=0.207581
 2017-11-04 09:24:54,514 INFO - root - [Epoch 17] time cost: 12.063521
 2017-11-04 09:24:54,789 INFO - root - [Epoch 17] training: accuracy=0.207723
 2017-11-04 09:24:54,789 INFO - root - [Epoch 17] time cost: 12.703974
 2017-11-04 09:24:55,671 INFO - root - [Epoch 17] validation: accuracy=0.099805
 2017-11-04 09:24:55,790 INFO - root - [Epoch 17] validation: accuracy=0.099805
 2017-11-04 09:25:08,027 INFO - root - [Epoch 18] training: accuracy=0.101542
 2017-11-04 09:25:08,027 INFO - root - [Epoch 18] time cost: 12.355660
 2017-11-04 09:25:08,254 INFO - root - [Epoch 18] training: accuracy=0.101420
 2017-11-04 09:25:08,254 INFO - root - [Epoch 18] time cost: 12.463921
 2017-11-04 09:25:09,120 INFO - root - [Epoch 18] validation: accuracy=0.132812
 2017-11-04 09:25:09,244 INFO - root - [Epoch 18] validation: accuracy=0.132422
 2017-11-04 09:25:21,767 INFO - root - [Epoch 19] training: accuracy=0.165957
 2017-11-04 09:25:21,767 INFO - root - [Epoch 19] time cost: 12.647036
 2017-11-04 09:25:22,021 INFO - root - [Epoch 19] training: accuracy=0.166895
 2017-11-04 09:25:22,021 INFO - root - [Epoch 19] time cost: 12.776301
 2017-11-04 09:25:22,887 INFO - root - [Epoch 19] validation: accuracy=0.178711
 2017-11-04 09:25:23,014 INFO - root - [Epoch 19] validation: accuracy=0.178711
 2017-11-04 09:25:35,266 INFO - root - [Epoch 20] training: accuracy=0.174316
 2017-11-04 09:25:35,266 INFO - root - [Epoch 20] time cost: 12.379125
 2017-11-04 09:25:36,353 INFO - root - [Epoch 20] validation: accuracy=0.099609
 2017-11-04 09:25:35,476 INFO - root - [Epoch 20] training: accuracy=0.174377
 2017-11-04 09:25:35,476 INFO - root - [Epoch 20] time cost: 12.461397
 2017-11-04 09:25:36,447 INFO - root - [Epoch 20] validation: accuracy=0.099609
 2017-11-04 09:25:49,051 INFO - root - [Epoch 21] training: accuracy=0.125703
 2017-11-04 09:25:49,051 INFO - root - [Epoch 21] time cost: 12.697860
 2017-11-04 09:25:49,324 INFO - root - [Epoch 21] training: accuracy=0.125938
 2017-11-04 09:25:49,324 INFO - root - [Epoch 21] time cost: 12.877100
 2017-11-04 09:25:50,191 INFO - root - [Epoch 21] validation: accuracy=0.103027
 2017-11-04 09:25:50,326 INFO - root - [Epoch 21] validation: accuracy=0.103027
 2017-11-04 09:26:02,858 INFO - root - [Epoch 22] training: accuracy=0.141866
 2017-11-04 09:26:02,858 INFO - root - [Epoch 22] time cost: 12.531573
 2017-11-04 09:26:03,849 INFO - root - [Epoch 22] validation: accuracy=0.164355
 2017-11-04 09:26:02,566 INFO - root - [Epoch 22] training: accuracy=0.141703
 2017-11-04 09:26:02,566 INFO - root - [Epoch 22] time cost: 12.374612
 2017-11-04 09:26:03,704 INFO - root - [Epoch 22] validation: accuracy=0.164551
 2017-11-04 09:26:15,989 INFO - root - [Epoch 23] training: accuracy=0.161458
 2017-11-04 09:26:15,989 INFO - root - [Epoch 23] time cost: 12.284893
 2017-11-04 09:26:16,260 INFO - root - [Epoch 23] training: accuracy=0.161580
 2017-11-04 09:26:16,260 INFO - root - [Epoch 23] time cost: 12.410921
 2017-11-04 09:26:17,149 INFO - root - [Epoch 23] validation: accuracy=0.127930
 2017-11-04 09:26:17,250 INFO - root - [Epoch 23] validation: accuracy=0.127930
 2017-11-04 09:26:30,197 INFO - root - [Epoch 24] training: accuracy=0.150918
 2017-11-04 09:26:30,197 INFO - root - [Epoch 24] time cost: 12.947518
 2017-11-04 09:26:30,986 INFO - root - [Epoch 24] validation: accuracy=0.186768
 2017-11-04 09:26:29,924 INFO - root - [Epoch 24] training: accuracy=0.150742
 2017-11-04 09:26:29,924 INFO - root - [Epoch 24] time cost: 12.774948
 2017-11-04 09:26:30,893 INFO - root - [Epoch 24] validation: accuracy=0.186768
 2017-11-04 09:26:43,268 INFO - root - [Epoch 25] training: accuracy=0.173706
 2017-11-04 09:26:43,268 INFO - root - [Epoch 25] time cost: 12.374998
 2017-11-04 09:26:43,386 INFO - root - [Epoch 25] training: accuracy=0.173889
 2017-11-04 09:26:43,386 INFO - root - [Epoch 25] time cost: 12.399672
 2017-11-04 09:26:44,430 INFO - root - [Epoch 25] validation: accuracy=0.177734
 2017-11-04 09:26:44,411 INFO - root - [Epoch 25] validation: accuracy=0.178125
 2017-11-04 09:26:57,201 INFO - root - [Epoch 26] training: accuracy=0.214102
 2017-11-04 09:26:57,201 INFO - root - [Epoch 26] time cost: 12.770151
 2017-11-04 09:26:57,416 INFO - root - [Epoch 26] training: accuracy=0.214609
 2017-11-04 09:26:57,416 INFO - root - [Epoch 26] time cost: 13.004954
 2017-11-04 09:26:58,310 INFO - root - [Epoch 26] validation: accuracy=0.189648
 2017-11-04 09:26:58,407 INFO - root - [Epoch 26] validation: accuracy=0.188477
 2017-11-04 09:27:10,548 INFO - root - [Epoch 27] training: accuracy=0.207926
 2017-11-04 09:27:10,548 INFO - root - [Epoch 27] time cost: 12.237620
 2017-11-04 09:27:10,724 INFO - root - [Epoch 27] training: accuracy=0.207642
 2017-11-04 09:27:10,724 INFO - root - [Epoch 27] time cost: 12.317256
 2017-11-04 09:27:11,691 INFO - root - [Epoch 27] validation: accuracy=0.176953
 2017-11-04 09:27:11,794 INFO - root - [Epoch 27] validation: accuracy=0.176660
 2017-11-04 09:27:24,447 INFO - root - [Epoch 28] training: accuracy=0.187891
 2017-11-04 09:27:24,447 INFO - root - [Epoch 28] time cost: 12.756263
 2017-11-04 09:27:24,702 INFO - root - [Epoch 28] training: accuracy=0.187422
 2017-11-04 09:27:24,702 INFO - root - [Epoch 28] time cost: 12.907530
 2017-11-04 09:27:25,578 INFO - root - [Epoch 28] validation: accuracy=0.173535
 2017-11-04 09:27:25,700 INFO - root - [Epoch 28] validation: accuracy=0.173730
 2017-11-04 09:27:38,035 INFO - root - [Epoch 29] training: accuracy=0.160156
 2017-11-04 09:27:38,035 INFO - root - [Epoch 29] time cost: 12.457565
 2017-11-04 09:27:38,300 INFO - root - [Epoch 29] training: accuracy=0.159688
 2017-11-04 09:27:38,300 INFO - root - [Epoch 29] time cost: 12.599412
 2017-11-04 09:27:39,178 INFO - root - [Epoch 29] validation: accuracy=0.105859
 2017-11-04 09:27:39,281 INFO - root - [Epoch 29] validation: accuracy=0.105762
 2017-11-04 09:27:51,418 INFO - root - [Epoch 30] training: accuracy=0.163391
 2017-11-04 09:27:51,419 INFO - root - [Epoch 30] time cost: 12.240153
 2017-11-04 09:27:51,648 INFO - root - [Epoch 30] training: accuracy=0.163554
 2017-11-04 09:27:51,649 INFO - root - [Epoch 30] time cost: 12.367119
 2017-11-04 09:27:52,531 INFO - root - [Epoch 30] validation: accuracy=0.181250
 2017-11-04 09:27:52,646 INFO - root - [Epoch 30] validation: accuracy=0.181348
 2017-11-04 09:28:05,408 INFO - root - [Epoch 31] training: accuracy=0.189805
 2017-11-04 09:28:05,408 INFO - root - [Epoch 31] time cost: 12.876806
 2017-11-04 09:28:05,695 INFO - root - [Epoch 31] training: accuracy=0.189043
 2017-11-04 09:28:05,695 INFO - root - [Epoch 31] time cost: 13.048735
 2017-11-04 09:28:06,599 INFO - root - [Epoch 31] validation: accuracy=0.101270
 2017-11-04 09:28:06,692 INFO - root - [Epoch 31] validation: accuracy=0.101465
 2017-11-04 09:28:18,842 INFO - root - [Epoch 32] training: accuracy=0.118795
 2017-11-04 09:28:18,842 INFO - root - [Epoch 32] time cost: 12.242412
 2017-11-04 09:28:19,118 INFO - root - [Epoch 32] training: accuracy=0.118835
 2017-11-04 09:28:19,118 INFO - root - [Epoch 32] time cost: 12.425529
 2017-11-04 09:28:19,966 INFO - root - [Epoch 32] validation: accuracy=0.124316
 2017-11-04 09:28:20,107 INFO - root - [Epoch 32] validation: accuracy=0.124316
 2017-11-04 09:28:32,794 INFO - root - [Epoch 33] training: accuracy=0.151797
 2017-11-04 09:28:32,794 INFO - root - [Epoch 33] time cost: 12.827599
 2017-11-04 09:28:33,057 INFO - root - [Epoch 33] training: accuracy=0.152500
 2017-11-04 09:28:33,057 INFO - root - [Epoch 33] time cost: 12.949730
 2017-11-04 09:28:33,737 INFO - root - [Epoch 33] validation: accuracy=0.164429
 2017-11-04 09:28:33,858 INFO - root - [Epoch 33] validation: accuracy=0.164551
 2017-11-04 09:28:46,166 INFO - root - [Epoch 34] training: accuracy=0.171570
 2017-11-04 09:28:46,166 INFO - root - [Epoch 34] time cost: 12.428155
 2017-11-04 09:28:46,314 INFO - root - [Epoch 34] training: accuracy=0.170939
 2017-11-04 09:28:46,314 INFO - root - [Epoch 34] time cost: 12.456010
 2017-11-04 09:28:47,284 INFO - root - [Epoch 34] validation: accuracy=0.091113
 2017-11-04 09:28:47,331 INFO - root - [Epoch 34] validation: accuracy=0.091211
 2017-11-04 09:28:59,454 INFO - root - [Epoch 35] training: accuracy=0.200175
 2017-11-04 09:28:59,454 INFO - root - [Epoch 35] time cost: 12.169878
 2017-11-04 09:28:59,660 INFO - root - [Epoch 35] training: accuracy=0.199219
 2017-11-04 09:28:59,661 INFO - root - [Epoch 35] time cost: 12.329283
 2017-11-04 09:29:00,544 INFO - root - [Epoch 35] validation: accuracy=0.166016
 2017-11-04 09:29:00,644 INFO - root - [Epoch 35] validation: accuracy=0.165918
 2017-11-04 09:29:13,445 INFO - root - [Epoch 36] training: accuracy=0.207461
 2017-11-04 09:29:13,446 INFO - root - [Epoch 36] time cost: 12.901035
 2017-11-04 09:29:13,628 INFO - root - [Epoch 36] training: accuracy=0.207129
 2017-11-04 09:29:13,628 INFO - root - [Epoch 36] time cost: 12.984184
 2017-11-04 09:29:14,531 INFO - root - [Epoch 36] validation: accuracy=0.219141
 2017-11-04 09:29:14,617 INFO - root - [Epoch 36] validation: accuracy=0.218652
 2017-11-04 09:29:26,817 INFO - root - [Epoch 37] training: accuracy=0.220561
 2017-11-04 09:29:26,817 INFO - root - [Epoch 37] time cost: 12.285670
 2017-11-04 09:29:27,083 INFO - root - [Epoch 37] training: accuracy=0.220988
 2017-11-04 09:29:27,083 INFO - root - [Epoch 37] time cost: 12.465264
 2017-11-04 09:29:28,053 INFO - root - [Epoch 37] validation: accuracy=0.211719
 2017-11-04 09:29:27,992 INFO - root - [Epoch 37] validation: accuracy=0.211914
 2017-11-04 09:29:41,097 INFO - root - [Epoch 38] training: accuracy=0.233086
 2017-11-04 09:29:41,097 INFO - root - [Epoch 38] time cost: 13.043344
 2017-11-04 09:29:41,048 INFO - root - [Epoch 38] training: accuracy=0.232734
 2017-11-04 09:29:41,048 INFO - root - [Epoch 38] time cost: 13.055960
 2017-11-04 09:29:42,184 INFO - root - [Epoch 38] validation: accuracy=0.208789
 2017-11-04 09:29:42,150 INFO - root - [Epoch 38] validation: accuracy=0.208496
 2017-11-04 09:29:54,349 INFO - root - [Epoch 39] training: accuracy=0.251506
 2017-11-04 09:29:54,349 INFO - root - [Epoch 39] time cost: 12.164769
 2017-11-04 09:29:54,587 INFO - root - [Epoch 39] training: accuracy=0.250997
 2017-11-04 09:29:54,587 INFO - root - [Epoch 39] time cost: 12.436299
 2017-11-04 09:29:55,494 INFO - root - [Epoch 39] validation: accuracy=0.221582
 2017-11-04 09:29:55,560 INFO - root - [Epoch 39] validation: accuracy=0.221094
 2017-11-04 09:30:07,752 INFO - root - [Epoch 40] training: accuracy=0.264648
 2017-11-04 09:30:07,753 INFO - root - [Epoch 40] time cost: 12.258241
 2017-11-04 09:30:08,011 INFO - root - [Epoch 40] training: accuracy=0.264364
 2017-11-04 09:30:08,011 INFO - root - [Epoch 40] time cost: 12.451312
 2017-11-04 09:30:08,926 INFO - root - [Epoch 40] validation: accuracy=0.243848
 2017-11-04 09:30:08,999 INFO - root - [Epoch 40] validation: accuracy=0.243457
 2017-11-04 09:30:22,040 INFO - root - [Epoch 41] training: accuracy=0.276797
 2017-11-04 09:30:22,040 INFO - root - [Epoch 41] time cost: 12.609574
 2017-11-04 09:30:23,007 INFO - root - [Epoch 41] validation: accuracy=0.267090
 2017-11-04 09:30:22,285 INFO - root - [Epoch 41] training: accuracy=0.276582
 2017-11-04 09:30:22,285 INFO - root - [Epoch 41] time cost: 13.285149
 2017-11-04 09:30:23,081 INFO - root - [Epoch 41] validation: accuracy=0.266846
 2017-11-04 09:30:35,615 INFO - root - [Epoch 42] training: accuracy=0.293233
 2017-11-04 09:30:35,615 INFO - root - [Epoch 42] time cost: 12.138875
 2017-11-04 09:30:35,850 INFO - root - [Epoch 42] training: accuracy=0.292664
 2017-11-04 09:30:35,850 INFO - root - [Epoch 42] time cost: 12.769114
 2017-11-04 09:30:36,715 INFO - root - [Epoch 42] validation: accuracy=0.276660
 2017-11-04 09:30:36,861 INFO - root - [Epoch 42] validation: accuracy=0.276953
 2017-11-04 09:30:49,848 INFO - root - [Epoch 43] training: accuracy=0.304004
 2017-11-04 09:30:49,848 INFO - root - [Epoch 43] time cost: 12.644325
 2017-11-04 09:30:50,106 INFO - root - [Epoch 43] training: accuracy=0.305586
 2017-11-04 09:30:50,106 INFO - root - [Epoch 43] time cost: 13.245051
 2017-11-04 09:30:50,984 INFO - root - [Epoch 43] validation: accuracy=0.279199
 2017-11-04 09:30:51,086 INFO - root - [Epoch 43] validation: accuracy=0.279395
 2017-11-04 09:31:03,438 INFO - root - [Epoch 44] training: accuracy=0.312724
 2017-11-04 09:31:03,438 INFO - root - [Epoch 44] time cost: 12.003252
 2017-11-04 09:31:03,627 INFO - root - [Epoch 44] training: accuracy=0.312154
 2017-11-04 09:31:03,627 INFO - root - [Epoch 44] time cost: 12.540570
 2017-11-04 09:31:04,502 INFO - root - [Epoch 44] validation: accuracy=0.254102
 2017-11-04 09:31:04,610 INFO - root - [Epoch 44] validation: accuracy=0.253809
 2017-11-04 09:31:17,423 INFO - root - [Epoch 45] training: accuracy=0.338047
 2017-11-04 09:31:17,423 INFO - root - [Epoch 45] time cost: 12.921498
 2017-11-04 09:31:18,579 INFO - root - [Epoch 45] validation: accuracy=0.342383
 2017-11-04 09:31:17,697 INFO - root - [Epoch 45] training: accuracy=0.337578
 2017-11-04 09:31:17,697 INFO - root - [Epoch 45] time cost: 13.086395
 2017-11-04 09:31:18,679 INFO - root - [Epoch 45] validation: accuracy=0.341992
 2017-11-04 09:31:31,281 INFO - root - [Epoch 46] training: accuracy=0.300313
 2017-11-04 09:31:31,281 INFO - root - [Epoch 46] time cost: 12.208615
 2017-11-04 09:31:31,563 INFO - root - [Epoch 46] training: accuracy=0.299886
 2017-11-04 09:31:31,563 INFO - root - [Epoch 46] time cost: 12.883930
 2017-11-04 09:31:32,430 INFO - root - [Epoch 46] validation: accuracy=0.106934
 2017-11-04 09:31:32,562 INFO - root - [Epoch 46] validation: accuracy=0.106738
 2017-11-04 09:31:44,819 INFO - root - [Epoch 47] training: accuracy=0.116252
 2017-11-04 09:31:44,819 INFO - root - [Epoch 47] time cost: 12.388544
 2017-11-04 09:31:45,095 INFO - root - [Epoch 47] training: accuracy=0.116577
 2017-11-04 09:31:45,095 INFO - root - [Epoch 47] time cost: 12.532603
 2017-11-04 09:31:45,963 INFO - root - [Epoch 47] validation: accuracy=0.120801
 2017-11-04 09:31:46,069 INFO - root - [Epoch 47] validation: accuracy=0.120898
 2017-11-04 09:31:58,771 INFO - root - [Epoch 48] training: accuracy=0.176543
 2017-11-04 09:31:58,771 INFO - root - [Epoch 48] time cost: 12.807537
 2017-11-04 09:31:59,021 INFO - root - [Epoch 48] training: accuracy=0.175898
 2017-11-04 09:31:59,021 INFO - root - [Epoch 48] time cost: 12.951865
 2017-11-04 09:31:59,902 INFO - root - [Epoch 48] validation: accuracy=0.141211
 2017-11-04 09:32:00,015 INFO - root - [Epoch 48] validation: accuracy=0.140332
 2017-11-04 09:32:12,181 INFO - root - [Epoch 49] training: accuracy=0.219889
 2017-11-04 09:32:12,181 INFO - root - [Epoch 49] time cost: 12.279304
 2017-11-04 09:32:12,433 INFO - root - [Epoch 49] training: accuracy=0.220398
 2017-11-04 09:32:12,433 INFO - root - [Epoch 49] time cost: 12.417594
 2017-11-04 09:32:13,295 INFO - root - [Epoch 49] validation: accuracy=0.226074
 2017-11-04 09:32:13,436 INFO - root - [Epoch 49] validation: accuracy=0.225977

The training log for dist_sync is quite bouncy. Did you try with smaller learning rate and check if it’s more stable?

I didn’t find a unit test with dist_device_sync mode in MXNet. In theory it should give the same result as dist_sync. Let me verify if dist_device_sync passes a simple test. I’ll let you know how it goes.

I verified at my end with this unit test on dist_device_sync kvstore, and the basic functionality seems to be working:

Does reducing learning rate help though?

BTW - what model are you training?

we (i’m working w/ @owenataws) were training resnet_v2_34 using cifar10 data. our learning rate was too high. we will have new test results soon and give update then.