Can someone shade light on:
This means that the input into a network has 1 million dimensions. Even an aggressive reduction to 1,000 dimensions after the first layer means that we need 10^9 parameters.
I see input vector is 10^3. Output vector is of-course two. I don’t know for what sequential architecture we would need 10^9 param?
According to the text, the input vector is 1M (pixels) and the first (fully connected) layer reduces the number of dimensions to 1K (i.e. the layer has 1000 neurons) which sums up to 10^6 x 10^3 = 10^9 parameters.
“But of course, if those biases do not agree with reality, e.g. if images turned out not to be translation invariant,”
this sentence seems to be incomplete.
The equations in 6.1.2 do not express a bias term explicitly and Conv2D implemented in 6.2.2 has a bias term. I think that it would be better if you express a bias term in 6.1.2 clearly. For example, we can add b[i,j] to the summation in each equation in 6.1.2. By translation invariance, it's clear that b[i,j] does not depend on i or j. Thus b[i,j] is a constant.
Please let me know if my understanding is not correct.