Data loading from .rec with multiprocessing?

I am working on cpu with 80 virtual cores(what cpu_count() tells me) and 6 nvidia1080 gpus.
I used mx.io.ImageRecordIter to load .rec and .idx, which contains about 3 million images. During training process, the cpu usage goes up to 3000%. I figured out if I use smaller batches and less preprocessing_threads, the usage could be back to like 600%, however that also reduces the speed of training. Is there a way to load my training data in .rec and .idx files with multiprocessing so that I could separate the workload?