Seems like here is suggesting that cross-entropy can be used in multi-label classification task, which by itself makes sense to me.

However, I feel like the context is all around binary- or multi-classification. And thus I’m not sure if I interpret the highlighted part correctly.

Can someone clarify this for me?

In the binary case, y is either 0 or 1, and for each class, we can conclude the following loss:

The we can generalize to multi-classes, where the total loss equals to each class probability times with the loss of each class(the above formula). This is also an expectation, since the probability add up to 1. (Think the y_j in your snapshot as a probability of y_j =1.)

That part makes sense to me.

But what confused me is that in 3.4.2.3, instead of binary entries, we can also apply generic probability vector like (0.1, 0.2, 0.7) there. This generic probability vector does not make sense to me when we are dealing with either binary or multi- classification. The closest situation I can think of to use this generic probability vector is when we are doing multi-label classification.