Using multiple Trainers?

jason_yu · August 27, 2018, 12:50am

Hi!

I want to optimize two different objective functions alternately w.r.t. two different but overlapping parameter sets, and the method I currently adopt is to create two Trainers (SGD). However, it seems that with two Trainers the GPU memory could easily run out. Could you explain what exactly happen when I use two Trainers? Does using an additional Trainer really mean much more memory consumption? If so, what is the best way to achieve the goal aforementioned? Thanks!

jason_yu · August 27, 2018, 8:49am

I just found out that I had made a mistake leading to much larger batches than I expected, which might explain why the memory could run out.

VishaalKapoor · August 28, 2018, 8:44pm

@jason_yu Adding to your comment about batch sizes, some common batch sizes are powers of 2 from 32 to 512.

The mini-batch size (B in Eq. (1)) is typically
chosen between 1 and a few hundreds, e.g.
B = 32 is a good default value, with values above
10 taking advantage of the speed-up of matrixmatrix
products over matrix-vector products.
The impact of B is mostly computational, i.e.,
larger B yield faster computation (with appropriate
implementations) but requires visiting
more examples in order to reach the same error,
since there are less updates per epoch.

from https://arxiv.org/pdf/1206.5533.pdf

Also good to look at: https://arxiv.org/pdf/1609.04836.pdf

Topic		Replies	Views
Relation between Trainer.step and backward Gluon	1	667	December 12, 2018
Jointly training with 2 Trainer Gluon	1	607	December 12, 2018
Aggregate gradients manually over n batches Gluon	26	6598	July 2, 2020
Is there any way to preserve the internal variables of a trainer? Gluon	2	369	August 11, 2018
Why deferred initialization? Gluon	3	1716	August 29, 2018

Using multiple Trainers?

Related Topics