Predicting softmax with variable number of labels

I have an interesting (maybe?) use case. I don’t think it’s been covered in previous threads, but apologies if this is a duplicate.

To make the problem concrete, let me first describe a model. Suppose I have the following softmax model with feature function F and weights W; given x, the probability of label y is:

h(x, y) = exp( F(x,y) * W ) / ( sum_{y'} exp( F(x,y') * W ) )

Note that F returns features of both the input x and the output y. A simple example is when y is a one-hot encoding of the label and F(x, y) returns the Kronecker product of x and y. In this case, the model is equivalent to a regular logit model. But you can imagine a case where F returns more complex features, such as if y has structure/attributes.

The key complication is this: for any given input, the eligible labels may change. For example, maybe certain labels are incompatible with certain inputs. With Y(x) denoting the eligible labels for x, the output is defined as:

h(x, y) = exp( F(x,y) * W ) / ( sum_{y' in Y(x)} exp( F(x,y') * W ) )

where y is assumed to be in Y(x). Note that the partition function (sum in the denominator) changed.

My question to the forum is: what is the best, most efficient way to implement this in MXNet?

We can assume that the maximum number of eligible labels is bounded, which means that we can just treat this as the original model with appropriate padding/masking. Accordingly, the features can be precomputed and stored in a matrix:

Fxy = nd.array(batch_size, max_num_labels, num_features)

Maybe sparse data structures would help if the eligible labels are sparse. Anyway, what is the right way to mask the unused features? Is there a built-in MXNet feature? Or should is zero padding the best way?

Given Fxy, predicting the preactivations is simple:

net = Dense(max_num_labels)
preact = net(nd.concat(*Fxy, dim=0))

Masking is needed when computing the softmax. Mathematically, we can multiply the preactivation output by -inf wherever the label is unavailable. This seems to work in mxnet; e.g.,

>>> nd.softmax(nd.array([1, 1, -np.inf]))

[0.5 0.5 0. ]
<NDArray 3 @cpu(0)>

Is there a built-in mask for the softmax function? If not, it would be nice to have.

The problem with padding/masking is that it’s inefficient. It would make more sense to store the features as a list of 2-D arrays:

Fx = [nd.array(num_eligible_labels[i], num_features) for i in range(batch_size)]

As far as I know, this can’t be converted to an NDArray, so it can’t be loaded onto the GPU, and one can’t do batch prediction. But is anything like this possible?

== Ben

For those interested, my current solution is to use the following prediction function.

def predict_softmax(net, features, mask):
    return nd.softmax(net(nd.concat(*features, dim=0)).reshape_like(mask) * mask + (mask - 1) * 1e10)

The value 1e10 is meant to represent inf. This is a workaround for the fact that (0 * inf) evaluates to NaN.

== Ben

I am not sure if there are more “elegant” ways but in gluonnlp we apply the mask to softmax this way: https://github.com/dmlc/gluon-nlp/blob/master/src/gluonnlp/model/attention_cell.py#L47-L50

You look like you’re on the right lines to me.

As an alternative I just tried out the following approach using log_softmax for more numerical stability. And although it does add epsilon to data, this should be negligible.

import mxnet as mx
import numpy as np

epsilon = np.finfo(float).eps
data = mx.nd.random.normal(shape=(2,3,4))
mask = mx.nd.random.randint(shape=(2,3,4), low=0, high=2).astype('float32')
log_softmax = mx.nd.log_softmax(data + (mask + epsilon).log())

I like your solution better. Using nd.where() seems to be no faster than addition, but it’s easier to understand and slightly more robust to user error (like using a mask value other than 1).