Machine Learning: Regularization (Part 1)

In the previous section, we learned about the basics of linear models and how they are optimised to find the parameters that transform the observed values to best predict our desired outcome. In this article we are going to build on this knowledge to investigate the generalisability of models and the so-called bias-variance trade-off and how this relates to the concept of overfitting and underfitting which is the central dogma in machine learning. This chapter is dedicated to developing a deep intiution for what regulization is, when it's useful, and why it works. More specifically, we will cover:

Understanding overfitting and underfitting, and the bias-variance trade-off
Introducing the train-test split
How to add a L2 penalty term to our linear regression model to constrain the model to generalize better to unseen data
Optimizing the ridge regression model by computing the gradient manually
Optional section: deriving the Normal Equation
Understanding the regularization constraint and the method of Lagrange multipliers
Changing the L2 penalty to an L1 penalty to perfom Lasso regression
Applying what we have learned so far to a real-world application of predicting house prices

Code is available at GitHub

GitHub - mklarqvist/machine-learning-from-scratch: Learning machine learning from scratch

Learning machine learning from scratch. Contribute to mklarqvist/machine-learning-from-scratch development by creating an account on GitHub.

GitHubmklarqvist

All relevant code and figures are available on GitHub

Linear regression refresher

Remember that in Ordinary Least Squares (OLS) linear regression, our primary objective is to minimize the mean squared error (MSE) between our target variable $y$ and the predictions made using our feature data $X$, scaled by the coefficients $\beta$. More concretely, the column vector $y$ represents the response variable or dependent variable in linear regression models. It is often denoted as:

$$y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \\ \end{bmatrix}$$

where $y_i$ represents the response variable for the $i$-th observation (or data point). Each element in the vector corresponds to the response variable for a specific observation.

The feature matrix, or design matrix, $X$ is defined as

$$X = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1n} \\ x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m1} & x_{m2} & \cdots & x_{mn} \\ \end{bmatrix}$$

where $ x_{ij}$ represents the $i$-th observation (or data point) for the $j$-th feature. Each row corresponds to an observation, and each column corresponds to a feature. To add a bias term, we append a column of all ones to the design matrix $X$. This column represents the intercept term or bias:

$$X = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1n} \\ 1 & x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{m1} & x_{m2} & \cdots & x_{mn} \\ \end{bmatrix}$$

Finally, the loss function $\mathcal{L}$ is defined as:

$$\begin{align*}\mathcal{L} &= \frac{1}{n} \sum_{i=1}^{n} (y_i - \textcolor{red}{\underbrace{(x_{i1}\beta_{i1} + x_{i2}\beta_{i2} + \ldots + x_{ij}\beta_{ij})}_{\text{Estimation }\hat{y}}})^2 \\ &= \frac{1}{n} \sum_{i=1}^{n} (y_i - \textcolor{red}{\sum_{j=1}^{p}x_{ij}\beta_{ij}})^2 \\ &= \underbrace{\frac{1}{n} ||y - \textcolor{red}{X\beta}||^2_2}_{\text{Vector form}}\end{align*}$$

These coefficients represent the change in the target variable $y$ for a one-unit change in the corresponding feature $X$. More concretely, as an example, given two coefficients $\beta_1 = 5$ and $\beta_2 = -10$, for every one-unit increase in $X_1$, $y$ is expected to increase by 5 units, and for every one-unit increase in $X_2$, $y$ is expected to decrease by 10 units. Units here refers to whatever unit $X$ and $y$ were measured in.

During the optimization process, our objective is to find the coefficients that directly minimize the MSE. As just mentioned above, each coefficient $\beta_i$ determines the magnitude of the effect that its corresponding feature $x_i$ has on the target variable $y$, indicating the importance or contribution of that feature to the prediction. Therefore, the coefficients serve as parameters for a linear transformation of the feature space, allowing us to construct a model that explains $y$ based on the relationships between the features. These coefficients represent the slopes of the hyperplane (or line, in the case of a single feature) that best fits our data in the feature space defined by $X$.

Overfitting and underfitting a model

Imagine we have a true underlying dataset that should be modelled as $y= 3X + 5$, representing a linear relationship between $X$ and $y$. However, in the real world, we wouldn't know this underlying distribution. We aim to develop a model that accurately captures this relationship and not just the observations we've seen. If we overly tailor our model to fit the data we have, it's likely to struggle when presented with new, unseen data. This phenomenon is known as overfitting.

For instance, let's consider fitting our model with polynomials. The general model equation for polynomial regression with $N$ degrees can be expressed as:

$$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \ldots + \beta_N x^N$$

In some cases, it's possible to perfectly overfit the data, achieving zero loss, by using a linear model with polynomials up to $N−1$ degrees. Let's apply this approach to overfit our data with just 10 observations:

$$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \ldots + \beta_9 x^9$$

Demonstration of overfitting a model to a set of observations by adding more and more polynomial terms $\beta_{1}X^{1} + \beta_{2}X^{2} +, \ldots, + \beta_{N}X^{N}$ (red line) until the model has 0 loss (perfect fit). The true underlying distribution is $y = 3X + 5$ (dashed green line). Our model can perfectly describe the data points that we observed (blue points).

Amazing! We found coefficients resulting in a loss of zero! However, this perfect fit comes at a cost: our model may perform poorly when presented with new data points drawn from the same underlying distribution. We can demonstrate this by drawing two new observations and plotting them (green points).

An overfit model (red) performs great on the data that was used to train the model (blue points) but poorly when predicting new data points (testing data) — even though all data points were drawn from the same underlying distribution $y = 3X + 5$.

On the flip side, an underfit model fails to adequately capture the underlying pattern of the data, resulting in poor performance even on observed data. Consider, for instance, attempting to fit a straight line to the true underlying function $y = \cos(x)$. A linear model will invariably struggle to capture the cyclical nature of the function.

The underlying distribution is the cosine function $y = \cos(x)$ (green dashed). Data were sampled from this distribution (blue points) as $ \cos(x) + N(0, 0.3)$ represents a noisy observation $y$ of the cosine function $\cos(x)$, where $N(0, 0.3)$ denotes a normally distributed noise with mean $0$ and standard deviation $0.3$. A linear model $y = \beta X_1 + X_2$ (blue) will always perform poorly due to the cyclic nature of the underlying distribution. In contrast, a high-degree polynomial model formulated as $y = \theta_0 + \theta_1 x + \theta_2 x^2 + \ldots + \theta_N x^N$ (where $N = 24$ in this case) will have excellent performance on the training data but very poorly on previously unseen data (testing data).

Earlier in this section, we discussed the challenge of modeling an underlying distribution that is almost always unknown in real-world scenarios. So, how do we determine whether our model is overfitting or underfitting? To address this, we employ a common practice known as data splitting. This involves dividing our dataset into two parts: one for training the model and the other for evaluating its performance after training.

We mitigate potential biases in our assessment by shuffling the data and then randomly partitioning it into training and testing sets. This randomization helps ensure that both sets contain a representative sample of observations from the dataset. Typically, the data is divided into 80% for training and 20% for testing. This allocation allows us to train the model on a majority of the data while retaining a separate portion to assess its generalization performance.

Each frame depicts a random train-test split of the data. The green dashed line represents the underlying cosine function.

Regularization refers to techniques used in machine learning to prevent overfitting and improve the generalization performance of models.

The bias-variance trade-off

The balance between overfitting and underfitting is referred to as the bias-variance trade-off. Lets briefly discuss what the terms bias and variance refers to in the context of machine learning.

Bias refers to the error introduced by approximating a real-world problem with a simplified model. A model with high bias makes strong assumptions about the underlying data and oversimplifies the relationship between features and the target variable. In other words, it fails to capture the true relationship between the variables. High bias often leads to underfitting, where the model performs poorly on both the training data and unseen data.

Variance, on the other hand, measures the variability of a model's predictions for a given data point. A model with high variance is sensitive to small fluctuations in the training data and can capture noise or random fluctuations as if they were part of the underlying pattern. High variance often leads to overfitting, where the model performs well on the training data but fails to generalize to new, unseen data.

To reduce bias, we aim to develop more complex models that can capture patterns and relationships in the data. For example, increasing the degree of a polynomial regression model (as shown above) or adding more features to a linear regression model (shown later) can help reduce bias by allowing the model to fit the training data more closely.

However, as we reduce bias by increasing model complexity, we often inadvertently increase variance. Complex models are more flexible and can fit the training data more closely, but they are also more susceptible to noise and fluctuations in the data. As a result, they may overfit to the training data and fail to generalize well to new, unseen data.

Achieving an optimal balance between bias and variance is essential for developing models that perform well on both observed and unseen data.

Intuition behind regularized regression

Imagine you've developed a linear model without any regularization, where the prediction is calculated as $x_1\beta_1 + x_2\beta_2 + b$. Now, let's introduce regularization by adding a penalty term to the loss function, which penalizes the sum of the squares of the non-intercept coefficients, $\lambda \sum_{j=1}^{2} \beta_j^2$:

$$\textcolor{orange}{\mathcal{L}} = \textcolor{red}{\underbrace{\sum_{i=1}^{n} (y_i - (x_{i1}\beta_1 + x_{i2}\beta_2 + b))^2}_{\text{Fitting term}}} + \textcolor{blue}{\underbrace{\lambda (\beta_1^2 + \beta_2^2)}_{\text{Penalty term}}}$$

where $\lambda$ is a scalar that controls how much of the penalty term is added.

Initially, adding this penalty term results in an increase in the loss. However, as we continue to optimize the loss function with the regularization term, we may find a way to lower the loss by shrinking the coefficients $\beta$ proportionally with their importance. For example, if $\beta_1$ is initially 10 and $\beta_2$ is 1, we can perhaps imagine that reducing $\beta_1$ to 5 and $\beta_2$ to 0.8 will still maintain a reasonably good fit. This leads to a decrease in the contribution to the loss from the regularization term (from $10^2 + 5^2 = 101$ down to $5^2 + 0.8^2 = 25.64$, if $\lambda = 1$).

As we increase the regularization multiplier $\lambda$, the optimization process prioritizes reducing the size of the coefficients, as large coefficients become increasingly costly. Eventually, less important features, such as $\beta_2$, may disappear or become negligible compared to more important features like $\beta_1$. If our linear model has many coefficients, as $\lambda$ increases they will become negligible in their order of importance until the point where having any parameter is too expensive. In the rare case that all features contribute exactly the same to the predictive power of the model they will all be shrunk by exactly the same amount.

Let's look at a concrete example using $y = 20 + 10X_1 - 2X_2$, where one dominant feature $X_1$ is 5 times as important to the model as the other complementary feature $X_2$.

Given the model $y = 20 + 10\beta_1 - 2\beta_2$ where $\beta_1$ is 5-fold more important than $\beta_2$. Left: Increasing $\lambda$ results in a proportional decrease in the coefficients $\beta_1$ and $\beta_2$ proportional to their importance. Note that the coefficients are not shrunk so much that they become 0. Coefficients will rarely be shrunk to exactly 0 in Ridge Regression. Right: Increasing $\lambda$ will always result in an increase in the total loss however the penalty term will grow quadratic with the coefficients resulting in an upside-down U shape. The levels of $\lambda$ required before the total loss begin shrinking again are generally grotesquely large and never used in practice.

This automatic selection of important features is one of the key benefits of using regularized regression models and is generally referred to as automatic feature selection.

Let's more concretely explore what is going by looking at the simplest model possible with a single coefficient $y = x\beta$ and with $x = 2$ and $y = 10$ which means that the best possible fit for $10 = 2\beta$ which occus when $\beta = 5$. If we fix our current guess for $\beta = 3$ while increasing the regularization multiplier $\lambda$ we will observe a linear increase in the contribution of the penalty term without affecting the fitting term. Similarly, if we plot the fitting and penalty terms over the range $0 \leq \beta \leq 10$ while increasing the regularization multiplier $\lambda$ we observe the same thing: the fitting term is unaffected while the penalty term keeps growing.

Left: Plot of coefficient $\beta$ where $\beta = 5$ is the optimal solution. Increasing the penalty term by increasing $\lambda$ has no effect on the fitting term and only the penalty term. The penalty term scales quadratic with $\beta$. Right: Plot of $\lambda$ while increasing $\beta$. The optimal solution occurs when $\beta= 5$. As $\beta$ approaches 5, we can see that the total cost decrease and that it is primarily driven by the fitting term.

This makes complete sense if we look at the regularized model definition as the $\lambda$ term is only present in the penalty term and is a scalar multiplier in front of the sum.

$$\sum_{i=1}^{n} (y_i - x_i\beta)^2 + \textcolor{blue}{\underbrace{\lambda \beta^2}_{\substack{\lambda\text{ is only present} \\ \text{in this term}}}}$$

Despite initially seeming counterintuitive, regularization ensures that the total loss is always larger than the fitting term, thereby preventing overfitting and improving the model's generalization ability.

The scalar $\lambda$ is not learnable during optimization and has to be chosen by the user and is therefore called a hyperparameter. We will discuss $\lambda$ in more detail later in this chapter.

Linear regression with regularization that penalizes the sum of the squares of the coefficients, $\lambda \sum_{j=1}^{2} \beta_j^2$ is called ridge regression. It has the general form:

$$\begin{align*}\textcolor{orange}{\mathcal{L}} &= \textcolor{red}{\underbrace{\sum_{i=1}^{n} (y_i - (\beta_0 + x_{i1}\beta_1 + \ldots + x_p \beta_{ip}))^2}_{\text{Fitting term}}} + \textcolor{blue}{\underbrace{\lambda (\beta_1^2 + \ldots + \beta_p^2)}_{\text{Penalty term}}} \\ &= \frac{1}{N} \sum_{i=1}^{N} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \end{align*}$$

for $n$ training examples with $p$ features and $\beta_0$ intercept term and $\beta$ coefficients.

In order to optimize this model, we must compute the partial derivatives with respect to the loss function as we did for linear regression (and any other machine learning model). Luckily, we learned about this in the previous chapter and the following computations are hopefully straightforward.

The gradient: Deriving the Partial Derivatives

To make this exercise easier for ourselves, let's compute the partial derivatives of the loss function $\mathcal{L}$ with respect to the parameters $a$ and $b$ for the simple model $y = ax + b$:

$$\mathcal{L} = \sum_{i=1}^{n} (y_i - (ax + b))^2 + \lambda a^2$$

For this model, there are two partial derivatives we must compute: with respect to $a$ ($\frac{\partial \mathcal{L}}{\partial a}$) and with respect to $b$ ($\frac{\partial \mathcal{L}}{\partial b}$).

Partial derivative with respect to $a$

Step 1: Expand the Squared Term

Expand the squared term inside the summation:

$$\mathcal{L} = \sum_{i=1}^{n} \textcolor{red}{(\underbrace{y_i - (ax + b)}_{\text{Expand term}})^2} + \lambda a^2$$

$$\textcolor{red}{(y_i - (ax_i + b))^2 = (y_i - ax_i - b)^2}$$

Step 2: Differentiate with Respect to $a$

Apply the chain rule and differentiate the loss function with respect to $a$:

$$\frac{\partial \mathcal{L}}{\partial a} = \frac{\partial}{\partial a} \left( \sum_{i=1}^{n} (y_i - ax_i - b)^2 + \lambda a^2 \right)$$

Step 3: Apply the Sum Rule

If we want to compute the partial derivative of something that can be described as the sum of two functions then we can use the sum rule in differentiation. It states that the derivative of the sum of two functions is equal to the sum of their derivatives:

$$\frac{\partial}{\partial x}(\textcolor{red}{f(x)} + \textcolor{blue}{g(x)}) = \frac{\partial}{\partial x}\textcolor{red}{f(x)} + \frac{\partial}{\partial x}\textcolor{blue}{g(x)}$$

Applying this rule to our use-case, we end up with:

$$\begin{align*}\frac{\partial \mathcal{L}}{\partial a} &= \frac{\partial}{\partial a} \left( \sum_{i=1}^{n}\textcolor{red}{\overbrace{(y_i - ax_i - b)^2}^{f(x)}} + \textcolor{blue}{\overbrace{\lambda a^2}^{g(x)}} \right) \\ &= \sum_{i=1}^{n} \frac{\partial}{\partial a} (y_i - ax_i - b)^2 + \frac{\partial}{\partial a} \left( \lambda a^2 \right)\end{align*}$$

Step 4: Apply the Chain Rule Again

Apply the chain rule to differentiate the squared term:

$$\frac{\partial}{\partial a} (y_i - ax_i - b)^{\textcolor{red}{2}} = \underbrace{\textcolor{red}{2}(y_i - ax_i - b)}_{\text{Derivative of outside}} \textcolor{blue}{\underbrace{\frac{\partial}{\partial a} (y_i - ax_i - b)}_{\text{Derivative of inside}}}$$

Step 5: Compute the Partial Derivative

Now, compute the partial derivative of $(y_i - ax_i - b)$ with respect to $a$:

$$\textcolor{blue}{\frac{\partial}{\partial a} (\xcancel{y_i} - ax_i - \xcancel{b}) = -x_i}$$

Step 6: Substitute into the Partial Derivative Expression

Substitute the result from Step 6 into the expression obtained in Step 5:

$$\frac{\partial}{\partial a} (y_i - ax_i - b)^{\textcolor{red}{2}} = \textcolor{red}{2}(y_i - ax_i - b)\textcolor{blue}{(-x_i)} = \textcolor{blue}{-}\textcolor{red}{2}\textcolor{blue}{x_i}(y_i - ax_i - b)$$

Step 7: Repeat for Regularization Term

$$\frac{\partial}{\partial a} \left( \lambda a^{\textcolor{red}{2}} \right) = \textcolor{red}{2}\lambda a$$

Step 8: Combine Terms

Combine the derivative of the squared term and the regularization term:

$$\frac{\partial \mathcal{L}}{\partial a} = -2 \sum_{i=1}^{n} x_i(y_i - (ax_i + b)) + 2 \lambda a$$

Partial derivative with respect to $b$

The only difference will be in Step 5 where

$$\textcolor{blue}{\frac{\partial}{\partial b} (y_i - ax_i - b) = (\xcancel{y_i} - \xcancel{ax_i} - 1b^{1-1}) = -1}$$

and Step 7 where the partial derivative will be 0 since it does not depend on $b$. Combine the derivative of the squared term and the regularization term:

$$\frac{\partial \mathcal{L}}{\partial b} = -2 \sum_{i=1}^{n} (y_i - (ax_i + b))$$

Final partial derivatives and the gradient

We now have our partial derivatives and we can go ahead an train our first model:

$$\frac{\partial \mathcal{L}}{\partial a} = -2 \sum_{i=1}^{n} x_i(y_i - (ax_i + b)) + 2 \lambda a$$

$$\frac{\partial \mathcal{L}}{\partial b} = -2 \sum_{i=1}^{n} (y_i - (ax_i + b))$$

These partial derivatives provide the gradients of the loss function with respect to the parameters $a$ and $b$, which are used in the optimization process (e.g., gradient descent) to update the parameters during training. These partial derivatives thus form the gradient for the loss function:

$$\nabla \mathcal{L} = \left[ \begin{array}{l}-2 \sum_{i=1}^{n} x_i(y_i - (ax_i + b)) + 2 \lambda a \\ -2 \sum_{i=1}^{n} (y_i - (ax_i + b)) \end{array} \right]$$

Closed-form solution

In the previous chapter, we very briefly discussed the closed-form solution for finding the optimal coefficients for a linear model. As a reminder, this equation is called the Normal Equation and has the form:

$$\mathbf{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$

where:

$\mathbf{\beta}$ is the vector of regression coefficients.
$\mathbf{X}$ is the design matrix of independent variables.
$\mathbf{y}$ is the vector of observed dependent variable values.

We haven't derived this equation so far (except accidentally with Newton's method) because it's not really important for understanding. Let's modify the equation as-is to develop our intuition first. At the end of this section, we will derive the Normal Equation for the very curious of you.

It turns out that we can modify the Normal Equation to incorporate the penalized term directly. We can walk through the process of incorporating the penalty to find the closed-form solution for Ridge Regression. Let's begin with explicitly writing out $\mathbf{X}^T \mathbf{X}$:

$$\mathbf{X}^T \mathbf{X} = \begin{pmatrix} \sum_{i=1}^{n} x_{i1}^2 & \sum_{i=1}^{n} x_{i1}x_{i2} & \cdots & \sum_{i=1}^{n} x_{i1}x_{ip} \\ \sum_{i=1}^{n} x_{i2}x_{i1} & \sum_{i=1}^{n} x_{i2}^2 & \cdots & \sum_{i=1}^{n} x_{i2}x_{ip} \\ \vdots & \vdots & \ddots & \vdots \\ \sum_{i=1}^{n} x_{ip}x_{i1} & \sum_{i=1}^{n} x_{ip}x_{i2} & \cdots & \sum_{i=1}^{n} x_{ip}^2 \end{pmatrix}$$

By explicitly writing out the matrix, we see that the diagonal elements (e.g. $x_1 x_1 = x_1^2$ and $x_2 x_2$ and so on) of the coefficient matrix corresponds to the coefficient themselves, and the off-diagonal elements (e.g. $x_1 x_2$ and $x_1 x_3$) of the coefficient matrix represent the interactions between different features. Including regularization terms for these off-diagonal elements would impose penalties on these interactions, potentially altering the relationships between features, which is typically not desired. Therefore, penalties are only added to the diagonal terms. By only penalizing the diagonal elements of the coefficient matrix, we enforce regularization by shrinking the coefficients towards zero while preserving the relationships between features. This allows penalized regression to effectively mitigate overfitting without fundamentally altering the structure of the model.

To add a penalty only the diagonal, we convert the scalar penalty term $\lambda$ into a matrix by multiplying it with the identity matrix ($\mathbf{I}$), which has zeroes everywhere except the diagonal. This operation results in a diagonal matrix where each element is scaled by $\lambda$:

$$\textcolor{red}{\lambda} \textcolor{blue}{\mathbf{I}} = \textcolor{red}{\lambda} \overbrace{\begin{pmatrix} \textcolor{blue}{1} & 0 & \cdots & 0 \\ 0 & \textcolor{blue}{1} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \textcolor{blue}{1} \end{pmatrix}}^{\textcolor{blue}{\text{Identity matrix}}} = \begin{pmatrix} \textcolor{blue}{1}\textcolor{red}{\lambda} & 0 & \cdots & 0 \\ 0 & \textcolor{blue}{1}\textcolor{red}{\lambda} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \textcolor{blue}{1}\textcolor{red}{\lambda} \end{pmatrix}$$

When multiplying a matrix or vector with a scalar it means that everything in the matrix is multiplied with the scalar. We can then use this matrix representation of $\lambda$ by simply adding it to $\mathbf{X}^T \mathbf{X}$:

$$\mathbf{X}^T \mathbf{X} + \textcolor{red}{\lambda} \mathbf{I} = \begin{pmatrix} \sum_{i=1}^{n} x_{i1}^2 + \textcolor{red}{\lambda} & \sum_{i=1}^{n} x_{i1}x_{i2} & \cdots & \sum_{i=1}^{n} x_{i1}x_{ip} \\ \sum_{i=1}^{n} x_{i2}x_{i1} & \sum_{i=1}^{n} x_{i2}^2 + \textcolor{red}{\lambda} & \cdots & \sum_{i=1}^{n} x_{i2}x_{ip} \\ \vdots & \vdots & \ddots & \vdots \\ \sum_{i=1}^{n} x_{ip}x_{i1} & \sum_{i=1}^{n} x_{ip}x_{i2} & \cdots & \sum_{i=1}^{n} x_{ip}^2 + \textcolor{red}{\lambda} \end{pmatrix}$$

Finally, we plug this modified matrix back into the Normal Equation to obtain the closed-form solution for Ridge Regression:

$$\mathbf{\beta} = (\underbrace{\mathbf{X}^T \mathbf{X} + \lambda \mathbf{I}}_{\text{Added penalty}})^{-1} \mathbf{X}^T \mathbf{y}$$

As with the Normal Equation, this modified equation gives us the optimal coefficients directly without optimization. This is helpful as we can compare how our gradient descent optimization performs compared to the best possible solution.

Note that we don't usually penalize the bias. If the design matrix $\mathbf{X}$ contains a bias feature (a column of all ones), we can modify the identity matrix by setting the first cell to 0 to end up with:

$$\begin{pmatrix} \sum_{i=1}^{n} x_{i1}^2 + \textcolor{blue}{0}\textcolor{red}{\lambda} & \sum_{i=1}^{n} x_{i1}x_{i2} & \cdots & \sum_{i=1}^{n} x_{i1}x_{ip} \\ \sum_{i=1}^{n} x_{i2}x_{i1} & \sum_{i=1}^{n} x_{i2}^2 + \textcolor{blue}{1}\textcolor{red}{\lambda} & \cdots & \sum_{i=1}^{n} x_{i2}x_{ip} \\ \vdots & \vdots & \ddots & \vdots \\ \sum_{i=1}^{n} x_{ip}x_{i1} & \sum_{i=1}^{n} x_{ip}x_{i2} & \cdots & \sum_{i=1}^{n} x_{ip}^2 + \textcolor{blue}{1}\textcolor{red}{\lambda} \end{pmatrix}$$

Optional: Deriving the Normal Equation

This section is entirely optional and is intended only as an opportunity to practice computing derivatives in vector form. The Normal Equation can be derived for the general linear regression model $y = \mathbf{X}\beta$ by finding the values of $\beta$ that minimize the mean squared cost function for the model that is defined as the sum of squared residuals. We can write the cost function in vector form by:

$$J(\beta) = \frac{1}{n}\sum_i (y_i - \sum_j^p x_{ij}\beta_j)^2 = \frac{1}{n} \underbrace{(\vec{y} - \mathbf{X}\vec{\beta})^T(\vec{y} - \mathbf{X}\vec{\beta})}_{\text{Vector form}}$$

where $\vec{y}$ is a vector, $\mathbf{X}$ is a matrix, and $\vec{\beta}$ is a vector. The vector arrow on top of $y$ and $\beta$ were included only clarity and will be dropped from the remaining calculations.

Expand the expression: Expand the expression using matrix algebra and transpose properties and distribute the terms:

$$\begin{align*}(y - \mathbf{X}\beta)^T(y - \mathbf{X}\beta) &= \overbrace{(y^T - \mathbf{X}^T\beta^T)}^{\text{Expand transpose}}(y - \mathbf{X}\beta) \\ &= \underbrace{y^Ty - y^T\mathbf{X}\beta - \mathbf{X}^T\beta^Ty + \mathbf{X}^T\mathbf{X}\beta^T\beta}_{\text{Expand terms}}\end{align*}$$

Rearrange the terms and notice that two terms can be expressed as the transpose of one another and simplify:

$$\begin{align*}y^Ty - \textcolor{red}{\overbrace{y^T\mathbf{X}\beta - y\mathbf{X}^T\beta^T}^{y^T\mathbf{X}\beta \Leftrightarrow y\mathbf{X}^T\beta^T}} + \mathbf{X}^T\mathbf{X}\beta^T\beta \\ &= y^Ty - \textcolor{red}{y\mathbf{X}^T\beta^T - y\mathbf{X}^T\beta^T} + \mathbf{X}^T\mathbf{X}\beta^T\beta \\ &= y^Ty - \textcolor{red}{2y\mathbf{X}^T\beta^T} + \mathbf{X}^T\mathbf{X}\beta^T\beta \end{align*}$$

Take the derivative: Compute the derivative of the function with respect to $\beta$.
$$\frac{\partial f(\beta)}{\partial \beta} = \xcancel{y^Ty} - 2y\mathbf{X}^T(\beta^{(1-1)})^T + \underbrace{\frac{\partial f(\beta)}{\partial \beta}\mathbf{X}^T\mathbf{X}\beta^T\beta}_{\text{Product rule in differentiation}}$$

The remaining term can be solved using the product rule in differentiation that says that you take the derivative of $g(x)$ multiplied by $h(x)$, and add $g(x)$ multiplied by the derivative of $h(x)$. We can decide on any split as long as they both contain one $\beta$ each. Let's pick the simplest split where one function is $\beta^T$ and the remaining terms is the other:

$$\frac{\partial f(\beta)}{\partial \beta}\mathbf{X}^T\mathbf{X}\beta^T\beta = \frac{\partial f(\beta)}{\partial \beta}g(x)h(x), \begin{cases}g(x) = \beta \\ h(x) = \mathbf{X}^T\mathbf{X}\beta^T \end{cases}$$

Solving using the product rule:

$$\frac{\partial f(\beta)}{\partial \beta}g(x)h(x) = \underbrace{\mathbf{X}^T\mathbf{X}\beta^T + \beta\mathbf{X}^T\mathbf{X}}_{\mathbf{X}^T\mathbf{X}\beta^T \Leftrightarrow \mathbf{X}^T\mathbf{X}\beta} = 2\mathbf{X}^T\mathbf{X}\beta$$

Going back to calculating the partial deriviative where we left off:

$$\begin{align*}\frac{\partial f(\beta)}{\partial \beta} &= \\ \xcancel{y^Ty} - 2y\mathbf{X}^T(\beta^{(1-1)})^T + \underbrace{\frac{\partial f(\beta)}{\partial \beta}\mathbf{X}^T\mathbf{X}\beta^T\beta}_{\text{Product rule in differentiation}} &= \\ \xcancel{y^Ty} - 2y\mathbf{X}^T(\beta^{(1-1)})^T + 2\mathbf{X}^T\mathbf{X}\beta &= \\ -2y\mathbf{X}^T + 2\mathbf{X}^T\mathbf{X}\beta\end{align*}$$

Setting the derivative to 0 and rearrange to solve for $\beta$:
$$\begin{align*}\textcolor{red}{-2y\mathbf{X}^T} + 2\mathbf{X}^T\mathbf{X}\beta &= 0 \\ \textcolor{blue}{2\mathbf{X}^T\mathbf{X}}\beta &= \textcolor{red}{2y\mathbf{X}^T} \\ \beta &= \frac{\textcolor{red}{\xcancel{2}y\mathbf{X}^T}}{\textcolor{blue}{\xcancel{2}\mathbf{X}^T\mathbf{X}}} \end{align*}$$

We end up with:

$$\boxed{\beta = \frac{y\mathbf{X}^T}{\mathbf{X}^T\mathbf{X}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^Ty}$$

which is the Normal Equation! It gives the values of $\beta$ that minimizes the cost function, and therefore provides the best fit to the data.

Training a Ridge Regression model from scratch

Now that we've established the key components for training a Ridge Regression model:

A model definition utilizing the sum of squared residuals (L2 loss)
The gradient necessary for optimizing the model via gradient descent
The closed-form solution to derive the optimal parameters $\beta$

Let's evaluate the performance of our machine learning model by comparing it to the optimal solution obtained from the closed-form solution. We'll accomplish this by training a Ridge Regression model with $\lambda = 0.1$ using mini-batch stochastic gradient descent (SGD) on the underlying distribution $y = 6x + 10$. Here's our approach:

Generate synthetic data from the true distribution and introduce random noise.
Incorporate the intercept term by adding a vector of 1's.
Adjust the first cell of the identity matrix to 0.
Compute the optimal coefficients using the closed-form solution.
Optimize our Ridge Regression model over 1,000 iterations using mini-batches of size 5.

This approach will enable us to assess the performance of our Ridge Regression model in approximating the true relationship between the features and the target variable.

Caution 1: Note that in ridge regression, the penalty term is typically added to the loss function after the regression term has been computed across all available data. However, when training with mini-batches of 5 training examples (or any other number), we aim to incorporate only a fraction of the penalty loss proportional to the mini-batch size into the regression loss. As the partial derivative of the non-intercept coefficient terms is represented by $-2 \lambda \beta$, we adjust this term to $-\frac{2}{\text{batch size}} \lambda \beta$ to scale the penalty appropriately for mini-batch SGD. No further adjustments are necessary.

Caution 2: In ridge regression, the intercept term is typically not penalized, as penalizing it might lead to undesirable consequences. Therefore, when implementing ridge regression with mini-batched stochastic gradient descent (SGD), it's essential to update the intercept separately from the other coefficients. To achieve this, it's important to ensure that the input to the ridge_regression_sgd function does not include the intercept feature. In other words, the design matrix passed to the function should exclude the column representing the intercept term. By doing so, we ensure that the intercept is updated correctly using its own partial derivative during the optimization process, while the regularization penalty is applied only to the other coefficients.

import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(0)
# Random values between 0 and 5
X = 5 * np.random.rand(100, 1)
# Underlying model 10 + 6x plus random noise
y = 10 + 6 * X + np.random.randn(100, 1)

# Add bias term to X
X_b = np.c_[np.ones((100, 1)), X]

# Compute closed-form solution for Ridge Regression
lambda_ = 0.1  # Regularization parameter
m, n = X_b.shape # Number of rows and columns of the design matrix X
I = np.eye(n) # Identity matrix
# We do not want to penalize the intercept (bias) term so we set the first
# cell in the identiy matrix to 0
I[0][0] = 0
# Equation for closed-form solution
theta_closed_form = np.linalg.inv(X_b.T.dot(X_b) + lambda_ * I).dot(X_b.T).dot(y)

# Optimal loss level according to closed-form solution
optimal_loss = np.mean((y - X_b.dot(theta_closed_form))**2) + lambda_ * (theta_closed_form[1][0]**2)

# Define function for computing gradients
def compute_gradients(X, y, a, b, lambda_, batch_size):
    # Non-intercept coefficients
    gradients_a = -2/batch_size * X.T.dot(y - (a * X + b)) + 2/batch_size * lambda_ * a
    # Intercept coefficient
    gradient_b  = -2/batch_size * np.sum(y - (a * X + b))
    return gradients_a, gradient_b

# Train Ridge Regression model with Stochastic Gradient Descent
def ridge_regression_sgd(X, y, alpha, lambda_, n_iterations, batch_size):
    a = np.random.randn(1)  # Random initialization of parameter a
    b = np.random.randn(1)  # Random initialization of parameter b
    losses = [] # Keep track of loss as we progress
    estimates_a = [] # Keep track of the estimate of a as we progress
    estimates_b = [] # Keep track of the estimate of b as we progress
    
    # Shuffle the training examples every epoch
    m = len(X)
    indices = np.random.randint(m, size=batch_size)  # Select random indices for mini-batch
    iterations_per_epoch = m//batch_size # Number of iterations per epoch

    for iteration in range(n_iterations):
      # Reshuffle the random indices for mini-batch for each epoch
      if iteration % iterations_per_epoch == 0:
        indices = np.random.randint(m, size=batch_size)
      
      # Mini-batch offset
      idx_from, idx_to = (batch_size*iteration % m), (batch_size*iteration % m) + batch_size
      X_batch = X[idx_from:idx_to]
      y_batch = y[idx_from:idx_to]
      
      # Compute the gradients
      gradients_a, gradient_b = compute_gradients(X_batch, y_batch, a, b, lambda_, batch_size)
      
      # Update step
      a = a - alpha * gradients_a  # Update parameter a
      b = b - alpha * gradient_b  # Update parameter b
      
      # Compute loss at each iteration
      loss = np.mean((y - (a * X + b))**2) + lambda_ * (a**2)

      # Store the current loss and estimates of a and b
      losses.append(loss)
      estimates_a.append(a[0])
      estimates_b.append(b[0])
    return a, b, losses, estimates_a, estimates_b

# Hyperparameters
alpha = 0.01  # Learning rate
batch_size = 5  # Mini-batch size
n_iterations = 1000  # Number of iterations (update steps)

# Train Ridge Regression model
a_sgd, b_sgd, losses, estimates_a, estimates_b = ridge_regression_sgd(X, y, alpha, lambda_, n_iterations, batch_size)

# Plot the data and the fitted lines
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# First subplot: Data and Fitted lines
axes[0].scatter(X, y, label='Data')
axes[0].plot(X, X_b.dot(np.array([b_sgd[0], a_sgd[0].squeeze()])), color='red', label='SGD')
axes[0].plot(X, X_b.dot(theta_closed_form), color='blue', label='Closed-form')
axes[0].set_xlabel('X')
axes[0].set_ylabel('y')
axes[0].set_title('Ridge Regression Comparison: SGD vs. Closed-form')
axes[0].legend()

# Second subplot: Loss over Iterations
axes[1].plot(np.arange(len(losses)) * batch_size, np.array(losses).squeeze())
axes[1].axhline(y=optimal_loss, color='green', linestyle='--', label='Optimal Loss (Closed-form)')
axes[1].set_xlabel('Training examples')
axes[1].set_ylabel('Loss')
axes[1].set_yscale('log')
axes[1].set_title('Loss over Training Examples')
axes[1].legend()

# Third subplot: Parameter Estimation over Iterations
axes[2].plot(np.arange(len(losses)) * batch_size, estimates_a, label='Estimate of a',color='blue')
axes[2].plot(np.arange(len(losses)) * batch_size, estimates_b, label='Estimate of b',color='orange')
axes[2].axhline(y=theta_closed_form[0][0], color='orange', linestyle='--', label='Optimal Intercept (Closed-form)')
axes[2].axhline(y=theta_closed_form[1][0], color='blue', linestyle='--', label='Optimal Coefficient (Closed-form)')
axes[2].set_xlabel('Training examples')
axes[2].set_ylabel('Parameter Value')
axes[2].set_title('Parameter Estimation over Training Examples')
axes[2].legend()

plt.tight_layout()
plt.show()

# Print the parameters
print("SGD - Intercept (b):", b_sgd[0])
print("SGD - Coefficient (a):", a_sgd[0])
print("Closed-form - Intercept (b):", theta_closed_form[0][0])
print("Closed-form - Coefficient (a):", theta_closed_form[1][0])

Target	SGD	Closed-form
Intercept	5.890	5.985
Coefficient	10.267	10.229

We see that our model trained with SGD, using the gradient derived above, is indeed converging on the optimal parameter values provided by the closed-form solution!

When we visualize the parameters $a$ and $b$ as coordinates $(x, y)$ and the loss function as the Z-axis, we obtain a 3D loss surface. Initializing our initial parameter guesses as $a=-10$ and $b=-10$, represented as a red ball, and then tracking the parameters as gradient descent progresses, we observe the ball rolling downhill until it reaches a stationary point. Depending on the complexity of the loss surface, this point could either be the optimal global solution or a locally optimal solution.

Excellent! We've successfully trained a Ridge Regression model using mini-batch stochastic gradient descent! While the examples presented so far may seem straightforward and boring, it's crucial to understand that the principles we've applied can be extended to tackle more intricate, real-world scenarios involving multiple variables and noisy data. We'll delve into a more realistic and complex real-world example shortly to illustrate the practical applications of these techniques.

Introducing the Lagrangian form and the Lagrange multiplier

Thus far we have discussed penalized regression simply as linear regression plus a penalty term. Although this is factually correct, it does not describe in the general sense why this works or what it really means. Let's take a step back. Suppose you have the following function $f(x, y) = 3x + 2y$ that you want to optimize but you want to limit the inputs of $(x,y)$ such that $x^2 + y^2 = 1$. In this example, it means that we want to find the $(x,y)$ point(s) along the path $g(x,y)$ that is the smallest and largest on the $f(x,y)$ plane. This is an example of a constrained optimization problem where $x^2 + y^2 = 1$ is referred to as the constraint.

In our example, the points of $g(x,y)$ lie on the vertices of a circular constraint. The circular shape arises because the sum of squares of $x$ and $y$ must be constant, since $x^2 + y^2 = 1$ in this case, leading to a constraint region that forms a circle. Here are a few example points:

$a = \sqrt{0.5} = 0.707$, $b = \sqrt{0.5} = 0.707$: $\sqrt{0.5}^2 + \sqrt{0.5}^2 = 1$
$a = 1$, $b = 0$: $1^2 + 0^2 = 1$
$a = 0$, $b = 1$: $0^2 + 1^2 = 1$
$a = \sqrt{0.5}$, $b = -\sqrt{0.5}$: $\sqrt{0.5}^2 + (-\sqrt{0.5})^2 = 1$

To solve this constrained optimization problem, we essentialy want to walk along the constraint path $g(x,y)$ and find the points with the smallest and largest $f(x,y)$. But it's not very accurate, or efficient, if we try to walk along this path while evaluating $f(x,y)$ — especially in higher dimensions.

So then how can we actually include this constraint into the optimization of our model? Here's where the Lagrangian comes into play. In the context of optimization problems, the Lagrangian form refers to a formulation that combines the objective function (to be minimized) with the constraints using Lagrange multipliers. It is named after the French mathematician Joseph-Louis Lagrange, who introduced the method.

Formal definition of the Lagrangian

Let's briefly describe the formalities of the Lagrangian function $\mathcal{L}(x,y, \lambda)$ before we focus on the intuition (feel free to jump back and forth with the next section as necessary). The Lagrangian is is defined as:

$$\mathcal{L}(x, y, \lambda) = f(x, y) - \lambda(g(x, y) - c)$$

where $g(x, y)$ represents the constraint function ($x^2 + y^2$ in our example case), and $c$ is the value it's constrained to (in this case, $c = 1$). These Lagrange multipliers are introduced to enforce the constraints while optimizing the objective function. In our example case, we have:

$$ \mathcal{L}(x,y,\lambda) = \underbrace{3x + 2y}_{f(x,y)} - \lambda(\underbrace{x^2 + y^2}_{g(x,y)} - 1)$$

and we want to find the "critical points" which refers to the points in the domain of the Lagrangian function where its gradient is zero with respect to both the decision variables and the Lagrange multiplier $\lambda$:

$$\nabla \mathcal{L}(x, y, \lambda) = \vec{0}$$

which means that at a critical point $(x_0, y_0)$, the vectors $f(x_0, y_0)$ and $g(x_0, y_0)$ point in same direction. But they do not necessarily have the same magnitude and we can scale one by $\lambda_0$ to match the magnitude of the other.

$$\nabla f(x_0, y_0) = \lambda_0 \nabla g(x_0, y_0)$$

Mental model and intuition

As a simple mental model, imaging you are walking along a path defined by a rule (constraint), like a hiking trail (in our case around a circle). As you walk, you are carrying a flag that always points to the steepest uphill direction (the gradient). If this flag has any part pointing in the same direction as the trail, it means you can keep walking and reach higher (or lower) ground. You'll keep walking until the flag points straight up (perpendicular to the trail).

So, when you reach that point where the flag ($\nabla f$) is perpendicular to the path (constraint), it's like finding a spot where you're neither climbing nor descending, a sort of peak or valley. In math terms, this means that at that specific point $(x_0,y_0)$, the gradient of $f$, $\nabla f(x_0,y_0)$, is directly proportional to the gradient of the path, $\nabla g(x_0,y_0)$. The two gradients may not have exactly the same magnitude so they are multiplied (scaled) by a constant value simply called the Lagrange multiplier, denoted as $\lambda_0$.

In simpler terms, at that point, it's as if the direction in which $f$ is changing matches perfectly with the direction of the path, but the speed of change is adjusted by the Lagrange multiplier to make sure everything fits together smoothly. This alignment ensures that you're at a spot where $f$ is neither increasing nor decreasing as you move along the path – it's like a balance point where everything lines up just right.

Solving the example $f(x, y) = 3x + 2y$ and $g(x,y) = x^2 + y^2 - 1$

As we just discussed, in order to solve this optimization problem and find the critical points we need to determine the gradient and set it to 0:

$$\nabla \mathcal{L}(x, y, \lambda) = \vec{0}$$

where the gradient $\nabla \mathcal{L}$ is the vector of all the partial derivatives of $L$ with respect to $x$, $y$, and $\lambda$. Let's compute them one-by-one:

Partial derivative with respect to $x$:

$$\textcolor{red}{\begin{align*}\frac{\partial L}{\partial x} &= 1*3x^{1-1} + \xcancel{2y} - \lambda\frac{\partial L}{\partial x}(x^2 + y^2 - 1) \\ &= 3 - \lambda(2x^{2-1} + \xcancel{y^2} - \xcancel{-1}) \\ &= 3 - 2\lambda x\end{align*}}$$

Partial derivative with respect to $y$:

$$\textcolor{green}{\begin{align*}\frac{\partial L}{\partial y} &= \xcancel{3x} + 1*2y^{1-1} - \lambda\frac{\partial L}{\partial y}(x^2 + y^2 - 1) \\ &= 2 - \lambda(\xcancel{2x^2} + 2*y^{2-1} - \xcancel{-1}) \\ &= 2 - 2\lambda y\end{align*}}$$

Partial derivative with respect to $\lambda$:

$$\textcolor{blue}{\begin{align*}\frac{\partial L}{\partial \lambda} &= \xcancel{3x} + \xcancel{2y} - \frac{\partial L}{\partial \lambda}\lambda(x^2 + y^2 - 1) \\ &= -(x^2 + y^2 - 1)\end{align*}}$$

We have now computed all the partial derivatives for $\mathcal{L}(x,y,\lambda)$ and now want to find the critical points by setting the gradient to 0 ($\nabla \mathcal{L}(x, y, \lambda) = \vec{0}$).

$$\begin{aligned} \nabla \mathcal{L}(x, y, \lambda) = \left[ \begin{array}{c} \textcolor{red}{\dfrac{\partial}{\partial x}} \\ \textcolor{green}{\dfrac{\partial}{\partial y}} \\ \textcolor{blue}{\dfrac{\partial}{\partial \lambda}} \end{array} \right] = \left[ \begin{array}{c} \textcolor{red}{3 - 2\lambda x} \\ \textcolor{green}{2 - 2\lambda y} \\ \textcolor{blue}{-(x^2 + y^2 - 1)} \end{array} \right] = \underbrace{\left[ \begin{array}{c} 0 \\ 0 \\ 0 \end{array} \right]}_{\vec{0}} \end{aligned}$$

From the partial derivatives of $x$ and $y$, if we solve for $\lambda$ we end up with:

$$\lambda = \textcolor{red}{\frac{3}{2x}} = \textcolor{green}{\frac{1}{y}}$$

Combining these and solving for $y$ gives us:

$$\textcolor{red}{\frac{3}{2x}} = \textcolor{green}{\frac{1}{y}} \implies \textcolor{red}{\left(\frac{3}{2x}\right)^{-1}} = \textcolor{green}{\left( \frac{1}{y}\right)^{-1}} \implies \textcolor{green}{y} = \textcolor{red}{\frac{2}{3}x}$$

Substitute $y = \frac{2}{3}x$ into the constraint equation $x^2 + y^2 = 1$:

$$\begin{align*}x^2 + \left(\frac{2}{3}x\right)^2 &= 1 \\ x^2 + \frac{4}{9}x^2 &= 1 \\ \frac{13}{9}x^2 &= 1 \\ x^2 &= \frac{9}{13} \\ x &= \pm \frac{3}{\sqrt{13}}\end{align*}$$

If $x = \frac{3}{\sqrt{13}}$, then $y = \frac{2}{3} \cdot \frac{3}{\sqrt{13}} = \frac{6}{3 \sqrt{13}} =\frac{2}{\sqrt{13}}$.

If $x = -\frac{3}{\sqrt{13}}$, then $y = \frac{2}{3} \cdot -\frac{3}{\sqrt{13}} = -\frac{6}{3 \sqrt{13}} =-\frac{2}{\sqrt{13}}$

So, we have two critical points: $(\frac{3}{\sqrt{13}}, \frac{2}{\sqrt{13}})$ and $(-\frac{3}{\sqrt{13}}, -\frac{2}{\sqrt{13}})$.

These critical points correspond to the maximum and minimum values of $f(x, y)$ subject to the given constraint. To determine which one is the maximum and which one is the minimum, we need to evaluate $f(x, y)$ at these points and compare the values. We find that:

$$f(\frac{3}{\sqrt{13}}, \frac{2}{\sqrt{13}}) = 3 \left( \frac{3}{\sqrt{13}} \right) + 2 \left( \frac{2}{\sqrt{13}} \right) \gt f(-\frac{3}{\sqrt{13}}, -\frac{2}{\sqrt{13}}) = 3 \left( -\frac{3}{\sqrt{13}} \right) + 2 \left( -\frac{2}{\sqrt{13}} \right)$$

If we plot these two critical points together with $f(x,y)$ and $g(x,y)$, we can see that these points indeed appears to be the points on the circle where $f(x,y)$ is the largest and smallest.

If you found that the linear example $f(x,y) = 2x + 3y$ was a bit too boring or unrealistic, here are the critical points for the saddle function $f(x,y) = x^2 - y^2$ and the monkey saddle function $f(x,y) = x^3 - xy^2$. I leave the derivation of the gradient up to you the reader — it should be very easy for you at this point!

Left: the saddle function $f(x,y) = x^2 - y^2$ with the constraint $g(x,y) = x^2 - y^2 - 1$ with the four critical points $(0, -1), (0, 1), (-1, 0), (1, 0)$ shown as pairs of red and blue points. Right: the monkey saddle function $f(x,y) = x^3 - xy^2$ so-named after the observation that a saddle for a monkey would require two depressions for the legs and one for the tail. It has the following 6 critical points ($(-1, 0), (1/2, -\sqrt{3}/2), (1/2, \sqrt{3}/2), (-1/2, -\sqrt{3}/2), (-1/2, \sqrt{3}/2), (1, 0)$) depicted in pairs of red, green, and blue.

Geometric Insights: Interpreting the Lagrangian

Another formal approach to understanding the Lagrangian involves considering the relationship between gradient vectors and contour lines. Specifically, at any point $(x,y)$ on a contour line, the gradient vector points perpendicular to that line. Consequently, when the contour lines of two functions intersect, their gradient vectors are parallel. This concept is illustrated through examples such as the Saddle and Monkey Saddle functions:

Connecting Constrained Optimization to Penalized Regression

So what has the Lagrangian in constrained optimization to do with penalized regression? Let's explore this connection by examining the general forms of both approaches. The Lagrangian in constrained optimization typically takes the form:

$$\mathcal{L}(x_1, x_2, \ldots, x_n, \lambda) = f(x_1, x_2, \ldots, x_n) + \lambda (g(x_1, x_2, \ldots, x_n) - c)$$

Interestingly, this bears a striking resemblance to the general form of penalized regression:

$$\mathcal{L}(\beta) = \underbrace{f(\beta)}_{\text{Fitting term}} + \underbrace{\lambda (g(\beta))}_{\text{Penalty term}}$$

where we aim to minimize the following objective function:

$$\sum_{i=1}^N { ( y_i - (\beta_0 + \sum_{j=1}^p x_{ij} \beta_j) )^2 } + \lambda \sum_{j=1}^p { \beta_j^2 }$$

We can interpret the penalized optimization problem as a constrained optimization problem, where we seek to minimize $f(\beta)$ subject to the constraint:

$$\sum_{j=1}^p \beta_i^2 \leq t$$

By drawing parallels between these two optimization approaches, we can see that there is a one-to-one correspondence between $t$ and $\lambda$.

Changing the penalty to an L1 term

So far, we have modelled the penalty as the sum of squares of the coefficients $\lambda_1 \sum_{j=1}^{p} \beta_j^2$. By changing the penalty from the sum of the squared coefficients to the sum of the absolute coefficients, we end up with Lasso Regression:

$$\frac{1}{N} \sum_{i=1}^{N} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \underbrace{\sum_{j=1}^{p} |\beta_j|}_{\text{L1 penalty}}$$

which is otherwise identical to Ridge Regression. Even though the models are very similar they exhibit different behaviour.

Visualizing the L1 constraint region

For L1 regularization, the constraint region is defined by the sum of the absolute values of the parameters ($|a| + |b| \leq \lambda, \lambda = 1$).

Suppose we have the following combinations of $a$ and $b$:

$a = 0.5$, $b = 0.5$: $|0.5| + |0.5| = 1$
$a = 1$, $b = 0$: $|1| + |0| = 1$
$a = 0$, $b = 1$: $|0| + |1| = 1$
$a = 0.5$, $b = -0.5$: $|0.5| + |-0.5| = 1$

These points lie on the vertices of a diamond-shaped constraint region centered at the origin. The diamond shape arises because the sum of absolute values of $a$ and $b$ must be constant ($\lambda = 1$ in this case), leading to a constraint region that forms a diamond.

The constraint region for $|a| + |b| = 1$ with four point highlighted $(0.5, 0.5)$, $(1, 0)$, $(0, 1)$, $(0.5, -0.5)$. The spinning blue line is pointing to all the possible $(a,b)$ positions in the constraint path that sums to 1. The values are shown in the top-left corner.

The L1 penalty promotes sparser solutions

One consequence of the difference in penalty type is that Lasso Regression tends to yield sparse solutions, meaning it often results in some coefficients being exactly zero. This property arises from the nature of the L1 penalty, which encourages sparsity by "shrinking" coefficients all the way to zero, effectively performing feature selection. In contrast, Ridge Regression tends to shrink coefficients towards zero without necessarily setting them exactly to zero, leading to a more continuous shrinkage of coefficients. Why is this?

If we look at the L1 and L2 penalty surfaces, we can see that the L1 surface has sharp angles and form "gutters". In contrast, the L2 penalty surface is smooth and

Left: Surface of the L1 penalty $g(x, y) = |x| + |y|$ forms stark lines. Imagine the optimization process where a ball is rolling down the walls of this surface to the bottom. Given the sharp angles, it is very likely that the ball will at some point end up in one of these "gutters" and will then not escape — hence setting the coefficient for that feature to 0. Right: Surface of the L2 penalty $g(x,y) = x^2 + y^2$ forms a smooth bowl-like surface. In stark contrast to the L1 penalty surface, during the optimization procedure where a ball is rolling down the walls of this surface, it is exceptionally unlikely that the ball will follow a trajectory that is exactly at 0 in either dimension. Even if the ball end up touching 0, or transiently becoming 0, the ball will have no problem escaping that location unlike the case in the L1 loss surface.

As we crank up the regularization strength by increasing $\lambda$, we see that the contribution (and shape) of either surface grows increasingly extreme. For the L1 penalty, this results in sparser and sparser solutions (more and more coefficients that are exactly 0). In contrast for the L2 penalty, increasing $\lambda$ similarly result in more aggressively shrunken coefficients however they will rarely, if ever, be exactly 0. We will explore this more in detail in the end of this chapter when we'll be looking at a real-world example with many features.

Animation of increasing levels of regularization strength of $\lambda$ from 0 to 10 showcasing how the penalty surface shapes become increasingly extreme.

Let's look at the same example we used above $y = 20 + 10X_1 - 2X_2$, where one dominant feature $X_1$ is 5 times as important to the model as the other complementary feature $X_2$. For Ridge Regression, the coefficients are shrunken proportional to their importance, however they rarely, if ever, become 0. In contrast, in Lasso Regression, terms will increasingly become forced to 0 in their order of importance — and thus it promotes sparse solutions.

Ridge Regression (solid lines) has a L2 penalty and shrinks coefficients towards 0 but very rarely all the way to 0. Lasso Regression (dashed lines) has a L1 penalty and shrinks coefficients all the way towards 0 resulting in sparser solutions.

Deriving the gradient for Lasso Regression

Perhaps unsurprisingly, the only difference will be the partial derivative with respect to the penalty term (Step 7 above) which for L2 is:

$$\frac{\partial}{\partial a} \left( \lambda a^{\textcolor{red}{2}} \right) = \textcolor{red}{2}\lambda a$$

and for L1 is:

$$\frac{\partial}{\partial a} \left( \lambda |a_j| \right) = \lambda \cdot \text{sign}(a)$$

Why the sign function? The gradient of the absolute function $|x|$ with respect to $x$ is the sign function ($\text{sign}(x)$) because the absolute function is not differentiable when $x = 0$. Let's explore why.

The absolute function $|x|$ is defined as follows:
$$|x| = \begin{cases} +x & \text{if } x \geq 0 \ \\ -x & \text{if } x < 0 \end{cases}$$

At $x = 0$, the function $|x|$ is not differentiable because its derivative would have to take on two different values depending on the direction from which $x$ approaches 0. However, at all other points, the gradient of the absolute function $|x|$ with respect to $x$ can be approximated by the sign function $\text{sign}(x)$:

$$\text{sign}(x) = \begin{cases} +1 & \text{if } x \ge 0 \ \\ -1 & \text{if } x < 0 \end{cases}$$

Therefore, to handle the non-differentiability of the absolute function at $x = 0$, the gradient is typically defined as the sign function $\text{sign}(x)$. This ensures that the gradient descent optimization algorithm can still be applied effectively even when the absolute function is involved in the loss function.

Final partial derivatives and the gradient

We now have our partial derivatives and we can go ahead an train our first model:

$$\frac{\partial \mathcal{L}}{\partial a} = -2 \sum_{i=1}^{n} x_i(y_i - (ax_i + b)) + \lambda \cdot \text{sign}(a)$$

$$\frac{\partial \mathcal{L}}{\partial b} = -2 \sum_{i=1}^{n} (y_i - (ax_i + b))$$

$$\nabla \mathcal{L} = \left[ \begin{array}{l}-2 \sum_{i=1}^{n} x_i(y_i - (ax_i + b)) + \lambda \cdot \text{sign}(a) \\ -2 \sum_{i=1}^{n} (y_i - (ax_i + b)) \end{array} \right]$$

Training a Lasso Regression model from scratch

We will train a Lasso Regression model with the exact same data as in the Ridge Regression case above to make sure that our gradient calculations are correct. Once we know that we can trust our calculations for the simple case, we can move on to a real-world example with many features.

Unlike in OLS and Ridge Regression, there is no closed-form solution for Lasso Regression as there it does not have a direct analytical solution.

Using the code from above, we can simply replace the `compute_gradients` function with:

# Define function for computing gradients
def compute_gradients(X, y, a, b, lambda_, batch_size):
    gradients_a = -2/batch_size * X.T.dot(y - (a * X + b)) + 1/batch_size * lambda_ * np.sum(np.sign(a))
    gradient_b  = -2/batch_size * np.sum(y - (a * X + b))
    return gradients_a, gradient_b

and remove the closed-form calculations.

Training a simple Lasso Regression model using mini-batch stochastic gradient descent reveals that our model converges to a satisfactory solution. Towards the end of the optimization process, we observe oscillations or fluctuations, resembling a bouncing motion. This behavior occurs as the optimization algorithm navigates the surface of the loss function, akin to a ball rolling along the bottom of a bowl. Lowering the learning rate at this juncture could potentially allow for further descent along the loss surface, optimizing the model even further.

Real-world example: predicting house prices

Let's move on to a real-world example and practice what we have learned so far by looking at a dataset consisting of 545 homes and their properties and the price they sold for. The dataset is available for free at Kaggle, a data science competition platform owned by Google. You will need a free account to download the dataset.

The dataset is a matrix of 545 rows and 13 columns and can be summarized as follows:

Feature name	Description	Data type
price	Price of the house in USD	Floating ($\mathbb{R}^+$)
area	Area of a house in square feet	Integer ($\mathbb{Z}+$)
bedrooms	Number of bedrooms	Integer ($\mathbb{Z}+$)
bathrooms	Number of bathrooms	Integer ($\mathbb{Z}+$)
stories	Number of stories	Integer ($\mathbb{Z}+$)
mainroad	Is the house connected to the main road?	Boolean ($\mathbb{Z} \in [0,1]$)
guestroom	Does it have a guest room?	Boolean ($\mathbb{Z} \in [0,1]$)
basement	Does it have a basement?	Boolean ($\mathbb{Z} \in [0,1]$)
hotwaterheating	Does it have a hot water heater?	Boolean ($\mathbb{Z} \in [0,1]$)
airconditioning	Does it have air-conditioning?	Boolean ($\mathbb{Z} \in [0,1]$)
parking	Number of parking spots	Integer ($\mathbb{Z}^+$)
prefarea	Is the house in a preferred area?	Boolean ($\mathbb{Z} \in [0,1]$)
furnishingstatus	The furnishing status (furnished/semi-furnished/unfurnished)	Categorical (String)

Looking concretely at the first five rows:

	price	area	bedrooms	bathrooms	stories	mainroad	guestroom	basement	hotwaterheating	airconditioning	parking	prefarea	furnishingstatus
0	13300000	7420	4	2	3	yes	no	no	no	yes	2	yes	furnished
1	12250000	8960	4	4	4	yes	no	no	no	yes	3	no	furnished
2	12250000	9960	3	2	2	yes	no	yes	no	no	2	yes	semi-furnished
3	12215000	7500	4	2	2	yes	no	yes	no	yes	3	yes	furnished
4	11410000	7420	4	1	2	yes	yes	yes	no	yes	2	no	furnished

Encoding `no`/`yes` as `0`/`1`

The fields mainroad, guestroom, basement, hotwaterheating, airconditioning, and prefarea, are all booleans but are currently encoded as strings in the form of yes and no . Since we cannot use strings directly, we have to recode yes to 1 and no to 0. Let's code this from scratch to understand how this is done.

# When given the key 'no' return 0, otherwise if the key is 'yes' return 1
remap = {'no': 0, 'yes': 1}
# An array to store the new recoded results
recoded = []
# Loop over a column (basement in this case) and recode the values
for value in df.basement:
  recoded.append(remap[value])

If we look at the first first rows, we can indeed see that we have successfully recoded yes to 1 and no to 0.

basement	basement_recoded
no	0
no	0
yes	1
yes	1
yes	1

We will similarly have to recode the remaining fields mainroad, guestroom, hotwaterheating, airconditioning, and prefarea. Let's wrap our code into a function and explore a more efficient approach to applying this transformation.

def recode_yes_no(series: pd.Series) -> pd.Series:
  # When given the key 'no' return 0, otherwise if the key is 'yes' return 1
  remap = {'no': 0, 'yes': 1}
  # An array to store the new recoded results
  recoded = []
  # Loop over a column (basement in this case) and recoded the values
  for value in series:
    recoded.append(remap[value])
  
  # Return the recoded array as a Pandas Series object
  return pd.Series(recoded)

Pandas dataframes have the neat function apply, that, as the name suggests, applied a function to each column. We select the columns of interest and apply our transformation:

# Specify the Boolean columns
target_columns = ['basement', 'mainroad', 'guestroom', 'hotwaterheating', 'airconditioning', 'prefarea']
# Use `apply` to apply our function to each column.
df[target_columns] = df[target_columns].apply(recode_yes_no)

One-hot encode categorical values

The final column furnishingstatus is a categorical field with three possible answers: furnished, semi-furnished, and unfurnished. How do we model this categorical value for use in our regression model? We transform categorical values into new features corresponding to each of the possible answers. In this case, there are three possible answers so we will create three new features called furnished, semi-furnished, and unfurnished and set a 1 in the matching column.

Categorical value	Furnished	Semi-furnished	Unfurnished
Furnished	1	0	0
Semi-furnished	0	1	0
Unfurnished	0	0	1

This technique is known as one-hot encoding. In the current case, it is not possible to have a house that's neither furnished, nor unfurnished, nor semi-furnished (all columns 0). In cases like this, it is customary to remove one of the features since the other two will describe the first. However, we will keep all three features in this example since we want to see the Lasso Regression model address this automatically. So we end up with a model with 14 features in the design matrix.

Let's code this from scratch to understand how this is done. We will use a set, which behaves exactly like a set in mathematics would, that keeps tracks of unique values as we loop over the values in furnishingstatus.

# A set is a data structure (and mathematical concept) that
# contains unique values.
keys = set()
# Loop over all the values in a column and add the value to the set
for value in df.furnishingstatus:
  keys.add(value)

Keys now correspond to the set {'unfurnished', 'furnished', 'semi-furnished'}. All we want now is to add a mapping from each key in the set to a number to end up with the equivalent of remap above. By leveraging dictionary comprehension and iterators in Python, we can simply write remap = {k: i for i,k in enumerate(keys)} to end up with {'unfurnished': 0, 'furnished': 1, 'semi-furnished': 2}. That's it! And it works for cases with tens or thousands of categories without having to sit down and write the mappings by hand.

# Store the recoded values
recoded = []
# Loop over the furnishingstatus column
for value in df.furnishingstatus:
  # Create an array of 0's that is as long as the number
  # of keys we have
  ret = np.zeros(len(remap))
  # Set the corresponding value in the array to 1
  ret[remap[value]] = 1
  # Store the recoded array
  recoded.append(ret)

We now have the one-hot encoded value for furnishingstatus. However, unlike the cases above, we want to map these keys to distinct columns. Luckily, Pandas dataframe takes as default row-centric data as input. So all we need to do is to specify that we want a new dataframe with the appropriate column names:

pd.DataFrame(recoded, columns=remap.keys())

and that's it!

unfurnished	furnished	semi-furnished
0	1	0
0	1	0
0	0	1
0	1	0
0	1	0

Now when you know how to code it yourself, we can instead use the provided function in Pandas called pd.get_dummies instead that returns this exact result. Try it yourself: pd.get_dummies(df.furnishingstatus).

Summarizing the recoding preprocessing

Putting everything together, we now have

def recode_yes_no(series: pd.Series) -> pd.Series:
  # When given the key 'no' return 0, otherwise if the key is 'yes' return 1
  remap = {'no': 0, 'yes': 1}
  # An array to store the new recoded results
  recoded = []
  # Loop over a column (basement in this case) and recoded the values
  for value in series:
    recoded.append(remap[value])
  
  # Return the recoded array as a Pandas Series object
  return pd.Series(recoded)

# Specify the Boolean columns
target_columns = ['basement', 'mainroad', 'guestroom', 'hotwaterheating', 'airconditioning', 'prefarea']
# Use `apply` to apply our function to each column.
df[target_columns] = df[target_columns].apply(recode_yes_no)

# A set is a data structure (and mathematical concept) that
# contains unique values.
keys = set()
# Loop over all the values in a column and add the value to the set
for value in df.furnishingstatus:
  keys.add(value)

remap = {k: i for i,k in enumerate(keys)}

# Store the recoded values
recoded = []
# Loop over the furnishingstatus column
for value in df.furnishingstatus:
  # Create an array of 0's that is as long as the number
  # of keys we have
  ret = np.zeros(len(remap))
  # Set the corresponding value in the array to 1
  ret[remap[value]] = 1
  # Store the recoded array
  recoded.append(ret)

furnishing_df = pd.DataFrame(recoded, columns=remap.keys())

and all we need to do now is to drop the furnishingstatus column we have converted to new columns and concatenate the new columns.

# Drop the categorical furnishingstatus column and concatenate the new columns
df = pd.concat([df.drop('furnishingstatus', axis=1), furnishing_df], axis=1)

That's it! We're done with all our data transformations.

Feature normalization and train-test split

We also want to scale our features such that they have 0 mean and 1 standard deviation because penalized regression is sensitive to the scale of the features. Let's think about it. If our model is designed to penalize large coefficients, if there is a difference in scale between two features e.g. price in the scale of millions vs parking spaces in the range of perhaps 0 to 3, the model is going to heavily penalize the coefficient for price even though the scale itself is uninteresting, while only the relationship between coefficients is relevant. Think about it, if we have scale in dollars (e.g. 500,000 to 10,000,000) but then divide by a million to instead model price as the number of millions (e.g. 0.5 to 10), have we changed anything? No. The coefficient still mean exactly the same thing, however it has a different scale. We address this problem of scaling in the general case by subtracting the mean from a feature and then dividing by its standard deviation. In a similar vein, when we normalize by dividing by its standard deviation (the amount of spread) we make sure that the data spread is bounded around 1. This normalization is equivalent to the standard score $z = \frac{X - \mu}{\sigma}$, and hence only referred to as standardizing or normalizing. Although normalizing with the standard score is the most common, there are many different kinds of normalizations possible. The second most common, and more frequently seen in deep learning, is the min-max normalization.

Caution! We want to normalize our data based only on the training data to prevent information to leak into the testing data. So let's first split our data into 80% training data and 20% testing data by randomly shuffling an array of row numbers and then selecting the rows of X and y in each split.

# Split data into 80% train and 20% test
X = df.iloc[:,1:] # Design matrix is all columns except the response variable
y = df.iloc[:,0] # Response variable (cost)
# Randomly permute an array of row numbers 0,1,2,...,rows of X
shuffled_order = np.random.permutation(np.arange(len(X)))
# Training indices are then the first 80%
train_idx = shuffled_order[0:int(len(shuffled_order)*0.8)]
# Testing indices are the 20% following the first 80%
test_idx  = shuffled_order[int(len(shuffled_order)*0.8):]

# Select train and test splits for X and y
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

We can then compute the mean and standard deviation of the training data only and use those to normalize both the training and testing data.

# Normalize dataset using the train data
# For the design matrix
X_train_mean = X_train.mean(axis=0)
X_train_std =  X_train.std(axis=0)
X_train = (X_train - X_train_mean) / X_train_std
X_test = (X_test - X_train_mean) / X_train_std
# For the response variable
y_mean = y_train.mean()
y_std  = y_train.std()
y_train = (y_train - y_mean) / y_std
y_test  = (y_test - y_mean)  / y_std

Model training and interpretation of results

We can then train our Ridge and Lasso Regression models on the training split over a range of $\lambda$, e.g. between $10^{-2}$ and $10^5$ over 100 steps np.logspace(-2, 5, 100), and observe the exact behavior discussed in the simpler case above for L1 and L2 penalties. As $\lambda$ increases, Ridge Regression coefficients undergo proportional shrinkage relative to their importance, though they never quite reach zero. Conversely, in Lasso Regression, coefficients experience aggressive shrinkage towards zero, also in proportion to their significance. Although the direct interpretation of these coefficients may not be as straightforward as in ordinary linear regression, they still provide valuable insights into the predictive features. For instance, attributes like the area of the house and the number of bathrooms emerge as strong predictors of higher selling prices, whereas the absence of furnishings tends to predict lower selling prices. Notably, in Lasso Regression, the variables furnished and semi-furnished are swiftly driven to zero.

So now we have trained and evaluated a Ridge and Lasso regression model across a range of $\lambda$. Which value of $\lambda$ do we pick? Why do we pick that one? These are all topics for the next part.

Conclusions

In this comprehensive exploration of regularization in machine learning using linear regression as our framework, we delved into critical concepts such as overfitting, underfitting, and the bias-variance trade-off, providing a foundational understanding of model performance. By introducing techniques like the train-test split, we laid the groundwork for robust model evaluation. Through the incorporation of L2 penalty terms, we illustrated the significance of regularization in enhancing a model's generalization capabilities, with practical demonstrations of gradient computation and the method of Lagrange multipliers to optimize ridge regression models. Additionally, our optional exploration of deriving the Normal Equation provided further insights into where this equation comes from. Moreover, we expanded our repertoire by transitioning to Lasso regression, substituting the L2 penalty with an L1 penalty, thereby enriching our understanding of regularization techniques. Finally, we applied these methodologies to a real-world scenario, predicting house prices, thereby emphasizing the practical relevance and applicability of our theoretical underpinnings. As we conclude this journey, armed with a deeper understanding of explicit regularization techniques, we are poised to tackle diverse machine learning challenges with confidence. In a future section of this chapter, we will explore additional explicit regularization techniques but also additional implicit techniques that are much more common in deep learning.

Linear regression refresher

Overfitting and underfitting a model

The bias-variance trade-off

Intuition behind regularized regression

The gradient: Deriving the Partial Derivatives

Partial derivative with respect to $a$

Partial derivative with respect to $b$

Final partial derivatives and the gradient

Closed-form solution

Optional: Deriving the Normal Equation

Training a Ridge Regression model from scratch

Introducing the Lagrangian form and the Lagrange multiplier

Formal definition of the Lagrangian

Mental model and intuition

Solving the example $f(x, y) = 3x + 2y$ and $g(x,y) = x^2 + y^2 - 1$

Geometric Insights: Interpreting the Lagrangian

Connecting Constrained Optimization to Penalized Regression

Changing the penalty to an L1 term

Visualizing the L1 constraint region

The L1 penalty promotes sparser solutions

Deriving the gradient for Lasso Regression

Final partial derivatives and the gradient

Training a Lasso Regression model from scratch

Real-world example: predicting house prices

Encoding no/yes as 0/1

One-hot encode categorical values

Summarizing the recoding preprocessing

Feature normalization and train-test split

Model training and interpretation of results

Conclusions

Encoding `no`/`yes` as `0`/`1`