Naive Bayes Classification

In definition of bayespost(data) function, the x in logpost += (logpx * x + logpxneg * (1-x)).sum(0) should be data.

There are certain terms like softmax which I feel are unknown to a beginner. Is this a concept which will be covered later or is there any resource where we can learn about this?

I got confused about the meaning of the notations P(x), P(y), P(x|y), etc.

To all, this chapter is rewritten to be more beginner friendly. (you may need a force fresh in case there is a cached version).

I don’t really get it in Section 2.5.4 for Bayes prediction:

def bayes_pred(x):
    x = x.expand_dims(axis=0)  # (28, 28) -> (1, 28, 28)
    p_xy = P_xy * x + (1-P_xy)*(1-x)
    p_xy = p_xy.reshape((10,-1)).prod(axis=1) # p(x|y)
    return p_xy * P_y

What is line 3 doing and how we are getting value of p(x|y)?

The original shape of p_xy is (10, 28, 28), so line 3 reshapes p_xy into (10, 784), and then does multiplication for all the 784 probabilities for each class.

If we can estimate \prod_i p(x_i=1 | y) for every i and y, and save its value in P_{xy}[i,y], here P_{xy} is a d\times n matrix with n being the number of classes and y\in{1,\ldots,n}.

It seems that \prod_i p(x_i=1 | y) should be p(x_i=1 | y) instead.


we could compute \hat{y} = \operatorname*{argmax}_y \prod_{i=1}^d P_{xy}[x_i, y]P_y[y], (2.5.5)

this equation seems incorrect. Probably it could be like,

\hat{y} = \operatorname*{argmax}_y \prod_{i=1}^d (x_iP_{xy}[i, y] + (1 - x_i)(1 - P_{xy}[i, y]))P_y[y]

I do not get this equation here.

p_xy = P_xy * x + (1-P_xy)*(1-x)

which is not explained in the context.

Since x_i can only be 1 or 0, we should have

p(x_i | y) = p(x_i | y) \delta(x_i - 1) + p(x_i | y) \delta(x_i - 0) = p(x_i = 1 | y) x_i + p(x_i = 0 | y) (1-x_i)
p(x_i | y) = p(x_i = 1 | y) x_i + (1 - p(x_i = 1 | y)) (1-x_i)

If P_{xy}[i, y] represents p(x_i = 1| y), we have

\hat{y} = \operatorname*{argmax}_y \> \prod_{i=1}^d (P_{xy}[i, y]x_i + (1-P_{xy}[i,y])(1-x_i))P_y[y],

For log case,

\hat{y} = \operatorname*{argmax}_y \> \sum_{i=1}^d (\log P_{xy}[i, y] x_i + \log (1 - P_{xy}[i, y])(1-x_i) + \log P_y[y].


Hi Yayun, thank you for your reply. Basically it is just a mathematical transformation, right?

Hi mru4913, you are welcome. I think it is. Just keep in mind that our goal is to find what the value of p(x_i | y) is. I was confused at the first time. But p(x_i = 1| y) reminded me that x_i could also be 0, and then I got the key that we need to calculate p(x_i = 0 | y).

What is the answer of 3rd question of exercise ?

what is delta in this case?

n_x[y] = nd.array(X.asnumpy()[Y==y].sum(axis=0)).
In this line why does one have to convert X to numpy and then index it, why not directly index it like X[Y==y] ?

I think it is a bug. To be consistent with the code snippet later which is used to demo the trick of avoiding underflow and overflow, the code here should be

p_xy = P_xy ** x + (1-P_xy)**(1-x)

Yes, I think so.

\prod_i p(x_i=1 | y)

should be

p(x_i=1 | y)