Alex's Notes

Regularization (Jurafsky Martin)

As discussed in Jurafsky Martin Chapter 05: Logistic Regression, in the section on Learning in Logistic Regression (Jurafsky Martin): Cross-Entropy Loss and Gradient Descent

The problem with learning weights that perfectly match the training data is that they will model noisy factors that accidentally correlate with the class. If a feature is perfectly predictive of the outcome because it happens in the training set to only occur in one class, it will be assigned a very high weight. This problem is overfitting. We need our model to generalize well from the training data to an unseen test.

To avoid this we introduce a regularization term, \(R(\theta)\), to the objective function. The regularization term penalizes large weights - a setting of weights that matches the training data perfectly but uses many weights with high values to do so is penalized more than a setting that matches the data less well, but uses smaller weights. The objective function for a batch of m examples becomes:

\[\hat{\theta} = \underset{\theta}{\mathrm{argmax}} \sum^m_{i=1} \log{P(y^{(i)}|x^{(i)}) - \alpha R(\theta)}\]

There are two common ways to compute \(R(\theta)\):

L2 Regularization

L2 regularization uses the (square of the) L2 norm of the weight values, ie it’s a quadratic function. The L2 norm is the same as the Euclidean distance of the vector \(\theta\) from the origin, and is denoted as \(||\theta ||_2\). With n weights then:

$$R(θ ) = ||θ ||_2^2 = ∑^nj=1 θ_j^2

In statistics, L2 regularization is known as ridge regression.

L1 Regularization

L1 regularization is a linear function of the weight values, named after the L1 norm, which is the sum of the absolute values of the weights, or the Manhattan distance from the origin and denoted as \(||W||_1\). With n weights:

\[R(\theta) = ||\theta ||_1 = \sum^n_{i=1} |\theta_i|\]

In statistics, L1 regularization is known as lasso regression.

Comparing L1 and L2 Regularization

L2 regularization is easier to optimize because of its simple derivative (\(\theta^2\) is just \(2\theta\)). L1 regularization is more complex (the derivative of \(|\theta |\) is non-continuous at 0).

L2 perfers weight vectors with many small weights, L1 prefers sparse solutions with some larger weights but many more weights set to 0. So L1 regularization leads to fewer features.

Both can be interpreted in Bayesian terms as constraints on the prior of how the weights should look. L1 regularization as a Laplace prior, L2 is equivalent to assuming that weights are distributed according to a Gaussian distribution with a mean of 0. By using a Gaussian prior, we are saying that the weights prefer to have the value 0.