Using multiple Trainers?


I want to optimize two different objective functions alternately w.r.t. two different but overlapping parameter sets, and the method I currently adopt is to create two Trainers (SGD). However, it seems that with two Trainers the GPU memory could easily run out. Could you explain what exactly happen when I use two Trainers? Does using an additional Trainer really mean much more memory consumption? If so, what is the best way to achieve the goal aforementioned? Thanks!

I just found out that I had made a mistake leading to much larger batches than I expected, which might explain why the memory could run out.

@jason_yu Adding to your comment about batch sizes, some common batch sizes are powers of 2 from 32 to 512.

The mini-batch size (B in Eq. (1)) is typically
chosen between 1 and a few hundreds, e.g.
B = 32 is a good default value, with values above
10 taking advantage of the speed-up of matrixmatrix
products over matrix-vector products.
The impact of B is mostly computational, i.e.,
larger B yield faster computation (with appropriate
implementations) but requires visiting
more examples in order to reach the same error,
since there are less updates per epoch.


Also good to look at: