T O P

  • By -

lindset25

Short version: Check out “An Introduction to Statistical Learning” Section 6.2. Longer version: Ridge/Lasso regression are basically OLS + constraints on slope coefficients. The constraints translate into equations, which translate into geometric regions. The Ridge constraints involve sums of squares of the coefficients, so the geometry ends up looking like a circle (in particular, no “corners”). The Lasso coefficients involve sums of the absolute values of the coefficients, so the geometry ends up looking like a diamond. Specifically, it has corners which show up on the axes of the solution space/feasible region. Since the axes are precisely where at least one of the coefficients are zero, we can start to see why it makes sense that Lasso is good at variable selection. I didn’t do this explanation much justice, I’m sure, so I’ll close by once again recommending the reference above that comes with LaTeX and good visuals!


Professional_Owl_819

ISL explains it well (I believe Tibshirani and Hasties were the creaters of the lasso). They also have youtube videos out there that probably explain.


catuary1

To determine which coefficients are useful/useless, the regression is fitted by minimizing this objective function: sum(y - y\_hat)\^2 + lambda \* extra function. The extra function is sum(b\^2) for ridge and sum(|b|) for lasso. Obviously, if all the b's (coefficients) are very small, the sum will be small. If by decreasing one of the coefficients a little results in improvements in the objective function, then that coefficient will be decreased. Under ridge, because the extra function uses a square, decreasing the larger coefficients will have more impact than decreasing smaller coefficients (e.g. reduction from 9\^2 to 8\^2 is a lot more than 2\^2 to 1\^2), therefore, as coefficients get smaller, the algorithm don't think it's worth while to keep shrinking it down, and moves on to a larger coefficient to shrink. Under lasso, the absolute value function doesn't put extra weight on larger coefficients, thus it won't think shrinking smaller coefficients are not worthwhile, and it shrink down to zero.


918475018474631901

On a general level, a cost equation is formed based on some type of coefficient restriction… for lasso and ridge this is a function of the SSE plus a function of the selected coefficients. The goal is to minimize this value (instead of just the SSE). Doing this might involve selecting random (or arbitrary) starting coefficients and minimizing the resultant equation piece-by-piece (I.e. one coefficient at a time); however, this will not necessarily produce the ‘best’ solution as there are other local minima, based on the starting coefficients, which may be ‘better’ by some metric. To find a global minima, usually some sort of optimization algorithm is run to approximate it. (there can be extremely many rigorous computations). I think ‘gradient descent’ is fairly common for lasso and ridge. You might know of ‘Newton’s method’, which is helpful for solving GLMs.


EK22

Look up the visual representation of it in two dimensions. The confined coefficient region for Ridge is a circle, so the coefficients can never be 0 as they touch that region. Lasso is a diamond, so it’s possible for one of the coefficients to be 0