http://d2l.ai/chapter_linear-networks/linear-regression.html
What makes linear regression linear is that we assume that the output truly can be expressed as a linear combination of the input features.
This seems different from what I’ve known: linear regression is linear w.r.t the parameters instead of the inputs.
It is actually the case that linear regression makes the assumption that the output is a linear combination of the input features. This means that your model assumes no other transformations besides multiplication by a scalar and addition is done to any of the input features.
Of course, you can change what you mean by ‘input features’ and apply some other (non-linear) transformations to the features first, but strictly speaking that is not part of the linear regression. Saying that linear regression is linear w.r.t the parameters is just a short hand way of saying that only linear combinations of the inputs with the parameters as coefficients are allowed.
I have a question about regression using mxnet, I am trying to use it on an expression like this where, w0 and x0 should be scalars:
auto stage1 = w0x0x0;
auto net = LinearRegressionOutput(“linreg”, stage1, Symbol::Variable(“label”));
My problem is that its not converging , I paste the whole code just in case.
#include
#include “mxnet-cpp/MxNetCpp.h”
using namespace std;
using namespace mxnet::cpp;
int main(int argc, char** argv)
{
const float learning_rate = 0.01;
vector<mx_float> input =
{
1.0,
3.0,
5.3,
8.0,
6
};
vector<mx_float> output =
{
3.1415,
9.4245,
16.64995,
25.132,
18.849
};
Context ctx0 = Context::cpu();
auto x0 = Symbol::Variable(“x”);
auto w0 = Symbol::Variable(“w0”);
auto stage1 = w0x0x0;
auto net = LinearRegressionOutput(“linreg”, stage1, Symbol::Variable(“label”));
NDArray daInputs = NDArray(input ,Shape(1,input.size()),ctx0);
NDArray daOutputs = NDArray(output,Shape(1,input.size()),ctx0);
std::map<string, NDArray> args0;
Optimizer* opt = OptimizerRegistry::Find(“adam”);
opt->SetParam(“lr”, learning_rate);//->SetParam(“wd”, weight_decay);
int epoch =100000;
auto arg_names = net.ListArguments();
while(epoch–)
{
for (int s=0 ; s != input.size(); s++)
{
args0[“x”] = NDArray({input[s]},Shape(1,1),ctx0);
args0[“label”] = NDArray({output[s]},Shape(1,1),ctx0);
auto exec0 = net.SimpleBind(ctx0, args0);
exec0->Forward(true);
exec0->Backward();
for(int i = 0 ; i != arg_names.size(); i++)
{
if (arg_names[i] == "w0")
opt->Update(i, exec0->arg_arrays[i], exec0->grad_arrays[i]);
cout << arg_names[i] << "=" << exec0->arg_arrays[i] << endl;
}
delete exec0;
}
}
return 1;
@mli in the gradient descent formula (3.1.10) aren’t you always changing all the coefficients together by the same value? shouldn’t this step be performed for each coefficient separately?
Hi @matanper ,
Think that the value of gradient depends on the value of \mathbf{w}, so in every iteration the values of w and b are different, so the gradient is different. I think it would be more clear written this way:
This explicitly express the iterative nature of w and b so the values of the gradients change with them too (see that \mathbf{w^{t+1}} and b^{t+1} only depends on values in t). Hopefully this makes more clear for you.
Hi everybody,
Does anybody know the answer to the third question?
This is my try but I don’t understand completely the question about the problem of the SGD with the Lapaciand Noise. Thi is my try:
-
This is actually the Laplace Distribution more than an exponential distribution, and this implies the L1 norm minimization so no closed form solution.
3.1
3.2 SGD for L1 (is actually a subgradient, technically the absolute value is not differentiable) (just assume that sgn(0) = 0 ):
What can go wrong besides the same things as with the L2 loss (for example too long learning rate)?
@gpolo
3.1 I got the same results from my derivation. Since the first term is constant, L(w, b) reduces to L1 loss minimization
3.3: Subgradient + update as defined by you. Coordinate descent can also be used here. If I remember correctly glmnet uses coordinate descent to solve L1 regularized regression (same as Laplace prior) as it converges faster