Creating a NDArray on gpu does not finish

I’m using the following code to create an array on gpu and the program never finishes. I got no error, the program just keeps running.

import mxnet as mx


The code runs as expected both on my other GPU and on CPU. I’ve checked the GPU usage using nvidia-smi.exe and the memory usage is continuously increasing from a starting value of approximately 90MiB. nvidia-smi gives the following output

Wed Jul  7 09:57:30 2021
| NVIDIA-SMI 471.11       Driver Version: 471.11       CUDA Version: 11.4     |
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Quadro RTX 4000     TCC  | 00000000:65:00.0 Off |                  N/A |
| 30%   51C    P8    19W / 125W |    306MiB /  8063MiB |      0%      Default |
|                               |                      |                  N/A |
|   1  Quadro P400        WDDM  | 00000000:B3:00.0  On |                  N/A |
| 34%   38C    P8    N/A /  N/A |    216MiB /  2048MiB |      0%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|    0   N/A  N/A      2960      C   ...iles\Python39\pythonw.exe      305MiB |

I’m using the following :

  • Windows server 2019 Essentials

  • Quadro RTX 4000

  • NVIDIA Graphics driver 471.11

  • Cuda 10.2.86

  • Python 3.9.5

  • pip 21.1.1

  • mxnet-cu102 2.0.0b20201108

  • cuDNN

I’ve tested both TCC and WDDM modes on the GPU. I’ve also used driver version 441.66 where the cuda version displayed by nvidia-smi is 10.2 but the issue persists. I’ve also tested mxnet-cu102 1.7.0. Both mxnet versions were downloaded from and installed using pip. I’ve tried reinstalling all the programs. I run the code using IDLE.

@Elias The system has the latest CUDA. Could you try if you can install MXNet 1.8 with CUDA 11?

Is there a pip package somewhere for mxnet-cu11x for Windows? I was only able to find linux versions from PyPI and

Meanwhile, I tested using cuda 9.2 and with both mxnet-cu92 1.6.0 and 1.7.0. The results were the same.

The output of nvidia-smi is incorrect as far as I know and is a result of updating the drivers. I haven’t installed cuda 11 (and the folder for that version does not exist). I’ve ran the code using driver version 441.66, which shows the correct cuda version and the code does not work.

I tried using mxnet-cu92 1.5.0 and both GPUs work now. I’d however like using cuda 10.2 as I have other stuff that use it (and it bugs me that I don’t know what I’m doing wrong).

I think the problem comes from something that happened between mxnet 1.5.0 and mxnet 1.6.0. I checked the release notes of 1.5.1 and 1.6.0. and the only change that seems relevant to me is Has something else changed in between that I should check?

I tried using mxnet 1.6.0 and downgrading to cuDNN v. (had v8.2.2.26 before) but it didn’t help.

Also found a similar discussion Using mxnet with CUDA freezes python


Would you mind giving MXNet 1.8 a try if it is possible?

I cannot find a 1.8 pip package for windows and I don’t really know how to compile from source. I tried following the compiling instructions. I didn’t succeed and I don’t even know how to find out what I’m doing wrong or what is not working.

This post was flagged by the community and is temporarily hidden.

This post was flagged by the community and is temporarily hidden.

I asked the same question here and found out that the array is created after roughly 15 min.

The problem can be fixed by creating an environment variable “CUDA_CACHE_MAXSIZE”. 1 GiB was a good value, at least for me. After this, the 15 min wait only occurs the first time a GPU is used and the subsequent runs are a lot faster, even if python is closed in between.