Correct way to train Sequential() model on GPU

Hi, I’ve been trying to train a model using the GPU on a server, but I’m getting the error:

Check failed: e == CUBLAS_STATUS_SUCCESS (13 vs. 0) : cuBLAS: CUBLAS_STATUS_EXECUTION_FAILED

I found a few similar topics on the forum, and they all seem to point out to some problems with the installed versions of CUDA.
I’m using (the second is the command I used to install cuda-toolkit):

mxnet-cu90mkl =1.5.1.post0
conda install -c anaconda cudatoolkit==9.0

but on the same machine I trained some GluonCV object detection models on GPU within the same environment, so I guess the problem is in my code.
It’s an older mxnet version (with respect to what indicated, for example, in the d2l book), so I’m not using mxnet.np, but:

from mxnet import ndarray as nd

Anyway, this is the code:

import mxnet as mx
from mxnet import autograd, gluon, init
from mxnet import ndarray as nd
import numpy as np
from mxnet.gluon import nn
from mxnet import init
from mxnet.gluon import loss as gloss


def main()

    ## read, preprocess and split data
    df_data = pd.read_csv('some_file.csv')
    df_data = pre_process(df_data)
    X_train, y_train, X_test, y_test = split_data(df_data)


    train(X_train, X_test, y_train, y_test, lr, batch_size, nr_epochs)


def train(X_train, X_test, y_train, y_test, lr, batch_size, nr_epochs):
    ctx = mx.gpu(2)
    y_train = mx.nd.array(y_train.to_numpy().reshape(-1,1), dtype=np.float32, ctx=ctx)
    y_test = mx.nd.array(y_test.to_numpy().reshape(-1,1), dtype=np.float32, ctx=ctx)
    X_train = mx.nd.array(X_train.to_numpy(), dtype=np.float32, ctx=ctx)
    X_test = mx.nd.array(X_test.to_numpy(), dtype=np.float32, ctx=ctx)

    ##--------------------
    ##   building model
    ##--------------------
    batch = batch_size
    epochs = nr_epochs
    dataset = gluon.data.dataset.ArrayDataset(X_train, y_train)
    data_loader = gluon.data.DataLoader(dataset, batch_size=batch, shuffle=True)

    model = nn.Sequential()
    model.add(nn.Dense(64, activation='relu'))
    model.add(nn.Dense(1))
    model.initialize(init.Normal(sigma=0.01), ctx)
    model.collect_params().reset_ctx(ctx)
    loss = gloss.L2Loss()
    trainer = gluon.Trainer(model.collect_params(), 'sgd', {'learning_rate': lr})

    ##--------------------
    ##   training
    ##--------------------
    for epoch in range(1, epochs + 1):
        for X_batch, Y_batch in data_loader:
            X_batch = X_batch.as_in_context(ctx)
            Y_batch = Y_batch.as_in_context(ctx)

            with autograd.record():
                l = loss(model(X_batch), Y_batch)
            l.backward()
            trainer.step(batch)
            print(nd.sum(l).asscalar())    -> ERROR HERE!!!

        ## I also tried, at the end of each epoch, the following lines!!!
        y_pred_train = model(X_train)
        l_train = loss(y_pred_train, y_train)
        l_train = l_train.asnumpy()
        l_train = (nd.sqrt(nd.sum(l_train*2)/l_train.shape[0])).asscalar()
        l_train = (nd.sqrt(nd.sum(l_train*2)/l_train.shape[0])).asnumpy()


As you can see, I tried a bunch of variations, using asscalar(), asnumpy(), but it’s always the same, I get the same error whenever I try to print the loss.

I’ll extract a sample of the csv file…how can I share it here, anyway?

remove this part of the code (the ctx parameter in initialize takes care of it):
model.collect_params().reset_ctx(ctx)
also, move your data to gpu when you want to train:
for X_batch, Y_batch in data_loader:
X_batch = X_batch.as_in_context(ctx)
Y_batch = Y_batch.as_in_context(ctx)

Still having the same issue.

File "/home/carlo/.conda/envs/d2l/lib/python3.8/site-packages/mxnet/base.py", line 255, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [14:20:17] src/operator/contrib/./../linalg_impl.h:213: Check failed: e == CUBLAS_STATUS_SUCCESS (13 vs. 0) : cuBLAS: CUBLAS_STATUS_EXECUTION_FAILED

The problem seems to be whenever I try to access the values of the loss function. Is it necessary to do anything about that specifically?

Also, is it necessary to use:

    X_batch = X_batch.as_in_context(ctx)
    Y_batch = Y_batch.as_in_context(ctx)

considering that, at the beginning, I’m specifying the context for X_train, y_train…?

    y_train = mx.nd.array(y_train.to_numpy().reshape(-1,1), dtype=np.float32, ctx=ctx)
    y_test = mx.nd.array(y_test.to_numpy().reshape(-1,1), dtype=np.float32, ctx=ctx)
    X_train = mx.nd.array(X_train.to_numpy(), dtype=np.float32, ctx=ctx)
    X_test = mx.nd.array(X_test.to_numpy(), dtype=np.float32, ctx=ctx)

what is the version of your mxnet, cuda, cudnn etc?

It will be nice to also share some of your ‘some_file.csv’ so that your issue can be reproduced. The MXNet runs asynchronizely at the backend. And your loss is installed the ctx (in GPU for your case). You might need to do print(l.asnumpy()) to synchronize at the point you want to check the loss and bring the data back to CPU memory.

I added the info to the original post, and also tried @TristonC 's suggestion, but it’s still the same

Have you tried MXNet 1.7? I would also suggest downloading an MXNet Docker container. And then run your script in the container.