- The training set has 12,000 samples whereas the test set has 4,000 samples. Intuitively either the training set should be weighted x1/3 or the test set should be weighted x3. How do I implement this in Gluon? Is using only 4,000 samples from the training set considered a valid method for “re-weighting data”?
- The textbook says we need to get function f. How do I obtain this function once I’m done training the classifier? Is it some attribute of the net instance?
- It says “Use the scores to compute weights on the training set”. What do you mean by the “scores”? My understanding is that once I have f I can compute \exp(f(x_i)) which is multiplied to
`loss(net(X), y)`

, so why do these “scores” even matter? - According to the textbook it is better to use \min(\exp(f(x_i)),c). What is c?

I’m not totally sure I understand your first question, but for 2, f should simply be the output of your network (so call net(x)), and I believe scores just refer to the outputs f(x_i). For the last question, c is just some constant, which you use because you don’t want the loss function to become unbounded, which could be the case when f(x_i) outputs extremely large values. This could happen as training progresses, as the network gets better at separating the classes.

i think the first question is asking about the hint in part 2, where it says we need to weigh the data before training the binary classifer

The first question was about how to weigh the data so that a sample from the test set matters much more than a sample from the training set, since the training set is thrice the size of the test set. My gut tells me that when I compute `loss(net(X), y)`

I need to multiply this by 3 if `y`

is 1 (i.e. it’s from the test set). Is this the right approach?

Also, for question 4, how do I compute c?

The re-weighting occurs when you train the classifier between the training/test set. From slide 47 in the lecture, we defined the distribution:

r(x,y) = \frac{1}{2}[p(x)\delta(y,1) + q(x)\delta(y,-1)]= \frac{1}{2}p(x)\delta(y,1) + \frac{1}{2}q(x)\delta(y,-1)

where the \frac{1}{2} comes from the assumption that the training and test sets are the same size. If they aren’t the same size, how would you want to re-weight this data distribution?

You can choose c somewhat arbitrarily… but think about what happens when you choose c very large/very small.