Learning in Logistic Regression (Jurafsky Martin): Cross-Entropy Loss and Gradient Descent
In logistic regression we want to learn weights w and bias b that will make the system’s estimate \(\hat{y}\) as close as possible to the true y. We need two things for this:
A loss function or cost function: a metric for the distance from \(\hat{y}\) to y. Typically in logistic regression we use cross-entropy loss.
An optimization algorithm for iteratively updating the weights to minimize this loss function. The standard is gradient descent.
The loss function: Cross-Entropy Loss
We need a loss function that expresses, for an observation x, how close the classifier output (\(\hat{y} = \sigma (w \cdot x + b)\)) is to the correct output y (0 or 1). We call this: \(L(\hat{y},y)\).
We define a loss function that prefers the correct class labels of the training examples to be more likely. This is called conditional maximum likelihood estimation: we choose the parameters w,b that maximize the log probability of the true y labels in the training data given the observations x. The resulting loss function is the negative log likelihood loss, generally called the cross-entropy loss.
There are only two discrete outcomes, so this is a Bernoulli distribution. We can express the probability that our classifier produces for one observation as the following (note this simplifies to \(\hat{y}\) if \(y = 1\), or to \(1 - \hat{y}\) if \(y = 0\)):
\[p(y|x) = \hat{y}^y (1 - \hat{y})^{1-y}\]
We take the log of both sides, which is handy later, to give us a log likelihood that should be maximized (this is also the cross-entropy formula from which the function gets its name):
\[\log{p(y|x)} = \log{\left[ \hat{y}^y (1 - \hat{y})^{1-y} \right]}\]
\[= y \log{\hat{y}} + (1-y)\log (1-\hat{y})\]
We want to turn this into a loss function, ie one we need to minimize, so we flip the sign to give us the cross entropy loss:
\[L_{CE}(\hat{y},y) = - \log{p(y|x)} = - \left[ y \log{\hat{y}} + (1-y)\log (1-\hat{y}) \right]\]
Now we can add the definition of \(\hat{y}\), namely \(\sigma (w \cdot x + b)\):
\[L_{CE}(\hat{y},y) = - \left[ y \log{\sigma (w \cdot x + b)} + (1-y)\log (1-\sigma (w \cdot x + b)) \right]\]
A perfect classifier would assign probability 1 to the correct outcome (whether that is 0 or 1), and probability 0 to the incorrect one. The higher \(\hat{y}\), the better the classifier. The negative log of this probability ranges from 0 (negative log of 1, no loss), to infinity (negative log of 0, infinite loss). This makes it a convenient loss metric.
Next we’ll see how to find the minimum.
The optimization algorithm: Gradient Descent
As gradient descent occurs everywhere, and the discussion is detailed, it is separated into a separate note: Gradient Descent
Regularization
The chapter moves on to consider the problem of overfitting and the Regularization steps that work around it.