luisa's blog

a violent introduction to linear regression with gradient descent

june 2024

sick and tired of gentle introductions to *insert an ml concept*?
worry no more, for today, we will catapult you to the wolves of linear regression.

this was an usually long and detailed post, please do not expect the quality \(\forall\) posts :]

ok, so have a bunch of points \( (x_1, y_1), (x_2, y_2), ... , (x_n, y_n) \), you'd like to find a linear model that defines the relationship between \(x\) and \(y\). in other words, you want a function \( y = mx + b \) derived from your training data, so that you can estimate what \(y\) is for an unseen \(x\).

there are two ways you can go about this. the first being the analytical solution where you solve for the optimal parameters with some hardcore linear algebra, which gives you the best™ solution. the second approach involves gradient descent, which is what this blog uses.

you absolutely should not get used to the luxury of having an analytical solution to your problem, this happens never

the \(m\) in the equation \( y = mx + b \) is known commonly as the slope, and \(b\) as the y-intercept. you can basically think of them as little knobs to tune; increasing \(m\) makes the line steeper, and increasing \(y\) moves the line upwards without changing the slope. that equation can be generalized to the \(n^{th}\) dimension as \(y = m_1 x_1 + m_2 x_2 + m_3 x_3 + ... + m_k x_k + b \). it's the same idea as the 2d case, you just got more knobs to tweak.

these knobs are formally known as parameters or weights in machine learning.

to get good knob settings, you want to minimize the error. the most common way to define error is known as `mean squared error`, which, you guessed it, takes the square and then the mean of all the errors.

\[ MSE = \frac{1}{n} \sum^n_{i = 1} (y_i - \hat{y}_i)^2\]

where \(\hat{y}\) is our estimate of \(y\) produced by our linear model, and is defined as \( mx_i + b\), and the error \(e\) - the difference between our best guess and the correct answer, is given by \((y_i - \hat{y}_i)^2\)

functions that define how good a model is, like MSE, are known as loss functions.

now onto gradient descent!

the below is a general explanation of gradient descent. there are many variations of gradient descent depending on the size of the sample used in backpropagation, we focus here on stochastic gradient descent.

the best way to think about gradient descent is in the shoes of an adventurer navigating some vast, hilly terrain looking for the point with lowest altitude. your current location will have a certain steepness in each direction that you can walk in. to find the point with the lowest altitude, as fast as possible, you'd want to walk off the steepest cliff you can find.

in formalized math jargon, this corresponds to finding the minima of the loss function. the steepness of each direction corresponds to the partial derivative of each independent variable of the loss function, also known as the parameters \(m\) and \(b\). to get to the minima of the loss, we calculate the partial derivative, which gives the direction of the steepest ascent, and move our parameters towards the opposite direction, the direction of the steepest descent.

there is a lot more complexity surrounding finding the local V.S. global minima, but for simplicity we will keep one eye shut today.

the derivation of the derivatives is relatively simple calculus, which actually took me way too long to do:

\[ MSE = \frac{1}{n} \sum^n_{i = 1} (y_i - \hat{y}_i)^2, \space e = (y_i - (mx_i + b))\]

\[ \frac{\partial{e^2}}{\partial{b}} = \frac{\partial{(y_i - wx_i - b)}^2}{\partial{b}} \]

\[ = -1 \times 2(y_i - wx_i - b) = -2(y_i - wx_i - b) \]

\[ \frac{\partial{MSE}}{\partial{b}} = \frac{1}{N} \sum^n_{i = 1} -2(y_i - wx_i - b)\]

\[ \frac{\partial{e^2}}{\partial{w}} = \frac{\partial{(y_i - wx_i - b)}^2}{\partial{w}} \]

\[ = -x_i \times 2(y_i - wx_i - b) = -2x_i(y_i - wx_i - b)\]

\[ \frac{\partial{MSE}}{\partial{w}} = \frac{1}{N} \sum^n_{i = 1} -2x_i(y_i - wx_i - b)\]

wielding one parameter in each hand, we can now improve our parameters by having them go a little bit towards the direction negative of their gradient:

\[ w \rightarrow w - \alpha \frac{\partial{MSE}}{\partial{w}}, \space b \rightarrow b - \alpha \frac{\partial{MSE}}{\partial{b}} \]

\(\alpha\) here is a constant called the learning rate , which controls how size of each step we take towards the minima. constants like \(\alpha\) are known as hyperparameters , which are essentially parameters for the model training, it's a parameter of an algorithm that finds parameters, pretty meta!

this step of calculating the partial derivative with respect to each parameter, and updating the weights to go down the loss gradient, is called backpropagation , short for backward propagation of errors.

as we move from the realm of pure math to the scary world of applied math, i introduce to you the matrix notation. in essence, the big hunk of formula for linear regression in higher dimensions can be vastly simplified by writing \(x\), \(y\) along with their parameters as matrices, so:

\[ y = mx + b \rightarrow \vec{y} = X\vec{w} + b\]

we make the notation switch here to represent the change from just one independent variable and one sample, to multiple variables (\(k\)) and multiple samples (\(k\)). as with most machine learning, the shapes are important to note:

\(\vec{y}\) is a vector of shape \((n, 1)\) where \(n\) is the number of samples
\(\vec{w}\) is a vector of shape \((k, 1)\) where \(k\) is the number of features
\(X\) is a matrix of shape \((n, k)\)
\(b\) is a scalar

a good habit is always to check that the dimensions work out right: \((n, k)(k, 1) + (1) = (n, 1) \), which is the shape of \(\vec{y}\).

if you're wondering why \(b\), a scalar, can be added to a vector of a different shape, look up the concept of `broadcasting`.

alright! that covers how we quntify the correctness of a model, how we make it better iteratively, and simple matrix notation. now we need to put the pieces together...

here is an appropriate amount of machine learning words that you may need:

an epoch represents one round of training, and a full iteration of the training data.

an split is a portion of the data, the train split is the data used directly in training, and a test split is the data used to evaluate the resulting model.

and... this is the actual algorithm for linear regression with stochastic gradient descent, this should be straightforward after all that buildup:

          randomly initialize weights W and b in the correct shape
          split the data into the train and test splits
          for every epoch: 
            for every data point in the training set: 
              update the weights W update the bias b
          evaluate the model's predictions on the test set
        

the actual implementation of the above is left as an exercise for the reader...

just kidding, you can find it here, written with just python and numpy.