The speedup of gradient compression seems not significantly for ResNet training

Hi, I have implemented a similar 1bit gradient compression algorithm in MXNet. However, when I train resnet110 on CIFAR-10 dataset to compare my implementation to 2bit compression and no compression algorithms, I found that the speedup of gradient quantization for ResNet training seems not significantly. The training command and logs are shown as follows, and I deployed training tasks across four nodes (each node is equipped four K80 GPUs), in which one parameter server and three workers. Is there any incorrect setup in my training process?

ps: I use the example code in dictionaryexample/image-classification
mxnet version: 1.4.0
cuda: 8.0

training command:

python ../../tools/launch.py --launcher ssh -H hosts -s 1 -n 3 python train_cifar10.py --gc-type 2bit --gc-threshold 1 --kv-store dist_sync --num-epochs 200 --batch-size 128 --lr-step-epochs 100,150 --wd 0.0001 --lr 0.1 --lr-factor 0.1 --network resnet --gpus 0,1,2,3

Training Result

100th epoch:

No Quantization 2bit Quantization 1bit Quantization
time cost (second) 19.27 19.777 18.545
validation accuracy 0.89122 0.887921 0.885871

150th epoch:

No Quantization 2bit Quantization 1bit Quantization
time cost (second) 18.73 22.357 20.339
validation accuracy 0.92758 0.929688 0.929109

200th epoch:

No Quantization 2bit Quantization 1bit Quantization
time cost (second) 19.048 18.846 19.649
validation accuracy 0.929988 0.935397 0.937500