GPU utils is low when training yolov3 network by gluoncv

When I’m training yolov3 with gluoncv, 4GPUs is waiting for data in long time. and I have set the biggest 28 num_workers.
So, how to change it?

Bigger batches? smaller num_workers? num_workers shouldn’t be the biggest possible, there is a sweet spot to find. multi-GPU training is not always faster than single GPU training, especially with modern GPUs like V100s. Inter-device communication is a big penalty, that’s not perceptible only when if tasks involve a LOT more compute time than update time, and that situation happens either with huge models (BERT), huge input records (eg computer vision on HD pictures) or very large batches