Hello everybody,
we’re training a network for a recommender system, on <user,item,score> triplets. The core code for the fit method is as follows:
for e in range(epochs):
start = time.time()
cumulative_loss = 0
for i, batch in enumerate(train_iterator):
# Forward + backward.
with autograd.record():
output = self.model(batch.data[0])
loss = loss_fn(output, batch.label[0])
# Calculate gradients
loss.backward()
# Update parameters of the network.
trainer_fn.step(batch_size)
# Calculate training metrics. Sum losses of every batch.
cumulative_loss += nd.mean(loss).asscalar()
train_iterator.reset()
where the train_iterator
is a custom iterator class that inherits from mx.io.DataIter
, and is returning the data (<user, item, score> triples) already in the appropriate context, as:
data = [mx.nd.array(data[:, :-1], self.ctx, dtype=np.int)]
labels = [mx.nd.array(data[:, -1], self.ctx)]
return mx.io.DataBatch(data, labels)
self.model.initialize(ctx=mx.gpu(0))
was also called before running the fit
method. loss_fn = gluon.loss.L1Loss()
.
The trouble is that nvidia-smi
reports that the process is correctly allocated into GPU. However, running fit
in GPU is not much faster than running it in CPU. In addition, increasing batch_size
from 50000 to 500000 increases time per batch by a factor of 10 (which I was not expecting, given GPU parallelization).
Specifically, for a 50k batch:
-
output = self.model(batch.data[0])
takes 0.03 seconds on GPU, and 0.08 on CPU. -
loss.backward()
takes 0.11 seconds, and 0.39 on CPU.
both assessed with nd.waitall()
to avoid asynchronous calls leading to incorrect measurements.
In addition, a very similar code that was running on plain MXNet took less than 0.03 seconds for the corresponding part, which leads to a full epoch taking from slightly above one minute with MXNet, up to 15 minutes with Gluon.
Any ideas on what might be happening here?
Thanks in advance!