I am implementing knowledge distillation-based DNN model training, as illustrated in the figure below, to run the teacher and student models (blue and green blocks) in parallel with the same data batch.
My plan is to put a light-weight pre-trained teacher model on CPU which only runs forward pass with frozen parameters. The student model is a large model to be trained on GPU(s).
I suppose moving a light task (teacher’s forward pass) to CPU can make it overlapped with the heavy training task on GPU and make this pipeline faster, compared with 2 models running in sequence as in many knowledge distillation projects (see below).
This task is not for model compression.
I’ve checked some popular repos like NervanaSystems/distiller and peterliht/knowledge-distillation-pytorch. They execute the forward operations of the student and teacher models in sequence (line by line), not in parallel on different devices (GPU or CPU).
I am trying to speed up this training process to run the 2 models at the same time using multiple devices (e.g., loading the small, inference-only model on CPU and not interrupting the GPU training of the heavy model).
What is the proper way to run 2 models (with Module() API of MXNet 1.x) in parallel? Should I use Python
multiprocessing library? Any recommendation on how to create a process to load the small teacher model and run
forward() with the same data input?