How to efficiently build a mask from indexes?

Hi there.
I need to build a mask symbol out of a list of indexes.

For example given a set of indexes idxs:


then I want a mask like the following

mask= [[0,1,1,0],[0,0,1,1],[1,1,0,0]].

where idxs has shape [batch_size, num_indexes]
and mask has shape [batch_size, max_values]

At the moment I am using the following:
mask = mx.sym.sum(
mx.sym.one_hot(idxs, depth=max_values,

Unfortunately, this takes a lot of gpu memory. It scales as batch_size * max_values * num_indexes. I was wondering if anyone here has some ideas on how to do this in a more efficient way.

Hi @peller,

Sounds like sparse matrices would help out here. You can then define the expected output much more directly using your indexes, and you’ll save tons of memory by avoiding the one hot encoding. Check out the example below using mx.nd by the api should be the same for mx.sym (

import mxnet as mx

# indicates which indicies and data relate to which rows
indptr = mx.nd.array([0, 2, 4, 6])
## row 0 is 0:2 from indices and data
## row 1 is 2:4 from indices and data
## row 2 is 4:6 from indices and data

# same as your `idxs` but flattened
indices = mx.nd.array([1, 2, 2, 3, 0, 1])

# all 1s in your example
data = mx.nd.array([1, 1, 1, 1, 1, 1])

a = mx.nd.sparse.csr_matrix((data, indices, indptr), shape=(3, 4))

Hi @thomelane, thanks for the reply. Unfortunately sparse array are not an option for me as we are running on gpus and I believe they are not supported for gpus.

@eric-haibin-lin could you clarify if sparse arrays can be used on GPU?

csr_matrix is supported on GPU but the scope is limited. You can do common operations such as sparse.where, contrib.SparseEmbedding and

@peller what do you want to do with the mask? would sparse.where work for you?

Thank you for the replies!
I need to mask a large softmax layer. I am doing policy gradient but not all the options are always available so I am masking the options that are not available. In practice, I implemented a numerically stable softmax that returns non-zero probabilities only for the indeces contained in the idxs symbol as in the example.
I believe sparse arrays are the right way to do it, but given that there is limited support I am not super-keen in following this path.

Hi @peller, I’m interested in your masked softmax. I implemented sampled_softmax at with sparse ndarray (on GPU). Is this similar to what you’re doing? Why do you want mask for your case?

Hi, I think it is something much simpler. I have a bunch of options and contexts. My model should output the best options given some context. I know at training time that some of the options are not available for some of the contexts. So I mask the output of my softmax to return non-zero probabilities only for the available options.
Does this make sense?

I see. Would it be helpful if MXNet support elementwise multiplication of csr * dense = csr on GPU?
In this way you only need a sparse mask, and get a sparse output.

It would be very beneficial. I believe that it would basically solve my problem.

Cool. I’ve put down the feature request in