Extreme memory usage

My mxnet code is either using an extreme amount of memory or not allocating memory properly for simple calculations.
This happens both on cpu (16GB of ram allocated to the program) as well as on GPU (6GB VRAM).

Currently my model has a dense layer at the end which outputs an ndarray of 16384 units.

If I run this on my CPU (and using my RAM), this gets reshaped into a (16, 32, 32) array on which the loss (L2Loss) is calculated.

If we take an example of the output of the l2loss function it’d be an NDarray like the following

[0.13459471 0.16294248 0.16834821 0.1700921  0.1658175  0.13800527
 0.16994812 0.16490564 0.16852918 0.11953183 0.11743582 0.12893616
 0.15624203 0.15499672 0.14183497 0.12786008]

This seems fine so far, and mxnet will be using a normal amount of memory, however if I attempt to calculate the mean of this (using ndarray’s mean() implementation), it errors and outputs

mxnet.base.MXNetError: [23:21:36] c:\jenkins\workspace\mxnet-tag\mxnet\src\storage\./cpu_device_storage.h:72: Failed to allocate CPU Memory

despite neither my RAM nor my VRAM being anywhere near fully used up.

This same problem also occurs when I use my GPU.
However when using my GPU I can’t even calculate the loss before it runs out of memory. Whether I’m using mx.nd.waitall() and mx.gpu(0).empty_cache() or not, it will run out of memory all the same when calculating the L2Loss.

Is there any obvious thing I’m running into that is causing these issues? I would think that if anything would be draining my memory it would be the dense layer itself with its output, however even if I ask for 65536 output units it does this effortlessly. For some reason it runs out of memory when computing the mean of a 1x16 array.
I’ll be happy to include more code if it’s necessary.

I have changed my network to output a convolutional layer rather than a dense layer.
I don’t know what is/was causing this issue but now it no longer crashes when computing the mean of an ndarray.