Hi, I have implemented a similar 1bit gradient compression algorithm in MXNet. However, when I train resnet110 on CIFAR-10 dataset to compare my implementation to 2bit compression and no compression algorithms, I found that the speedup of gradient quantization for ResNet training seems not significantly. The training command and logs are shown as follows, and I deployed training tasks across four nodes (each node is equipped four K80 GPUs), in which one parameter server and three workers. Is there any incorrect setup in my training process?
ps: I use the example code in dictionaryexample/image-classification
mxnet version: 1.4.0
cuda: 8.0
training command:
python ../../tools/launch.py --launcher ssh -H hosts -s 1 -n 3 python train_cifar10.py --gc-type 2bit --gc-threshold 1 --kv-store dist_sync --num-epochs 200 --batch-size 128 --lr-step-epochs 100,150 --wd 0.0001 --lr 0.1 --lr-factor 0.1 --network resnet --gpus 0,1,2,3
Training Result
100th epoch:
No Quantization | 2bit Quantization | 1bit Quantization | |
---|---|---|---|
time cost (second) | 19.27 | 19.777 | 18.545 |
validation accuracy | 0.89122 | 0.887921 | 0.885871 |
150th epoch:
No Quantization | 2bit Quantization | 1bit Quantization | |
---|---|---|---|
time cost (second) | 18.73 | 22.357 | 20.339 |
validation accuracy | 0.92758 | 0.929688 | 0.929109 |
200th epoch:
No Quantization | 2bit Quantization | 1bit Quantization | |
---|---|---|---|
time cost (second) | 19.048 | 18.846 | 19.649 |
validation accuracy | 0.929988 | 0.935397 | 0.937500 |