Artificial Intelligence

Machine Learning

17 апреля

15 minutes

Linear Regression as Simple as It Gets

In this article, we will look at one of the simplest machine learning algorithms, namely — linear regression.

Vyacheslav Gorash

3D Graphics and Machine Learning Developer with 6 years of experience

This article is part of a series on the fundamentals of machine learning.

In the previous article, we looked at machine learning in general, without going into the details of how it works. In this article, we'll start exploring specific algorithms. And we'll begin with what I consider the simplest model — linear regression.

Understanding the Terminology

First, let's figure out what problem we're solving and how. So, we're solving a regression problem. We discussed what this kind of problem is in the previous article, but just in case, here's a reminder. We have some data $X$ . It can be a single number or several (such a set of numbers is called a vector). We want to obtain a single value for $X$ as the output, which we denote as $\widehat{y}$ . This value is continuous and can be in any range.

As an example of such a problem, consider determining a person's weight based on their height. Then our input data is a single number — the person's height in centimeters. The output is also a single number — weight in kilograms. How do we calculate this value?

The second word in the method's name gives us a hint. Our regression is linear. What does this mean? It means that to determine the output value, we'll use the formula of a straight line. Here it's worth recalling the school curriculum, which states that a straight line is defined by the following expression:

y = kx + b

This formula is easy to adapt for our case. Simply denote $y$ as the person's weight, and $x$ as their height. Actually, we need to slightly adjust the notation. Usually, $y$ denotes the correct answer, that is, in our case, the person's actual weight. We, however, are trying to make a prediction, and our result, as mentioned above, will be denoted not as $y$ , but as $\widehat{y}$ . But let's return from notation to the essence. All we need to do is find the coefficients $k$ and $b$ , and we can approximately solve the problem.

But the solution won't be very accurate (and most likely — very inaccurate). The thing is, height and weight are not directly related by a linear dependency. There are tall and thin people, and there are short and heavy people. Such a simple formula will work poorly. To increase accuracy, we need to make the model more complex. Let's add waist circumference. Now our input data is no longer a single number $x$ , but a vector $\left\lbrack x_{1},\ x_{2} \right\rbrack$ : $x_{1}$ denotes height, and $x_{2}$ — waist circumference. Our output data remains the same — weight in kilograms. The formula changes, but not radically:

\widehat{y} = k_{1}x_{1} + k_{2}x_{2} + b

For successful training, we now need to find not two, but three parameters: $k_{1},\ k_{2}$ and $b$ . To further improve accuracy, we can keep adding input data, for example, leg length, chest circumference, and so on. Moreover, each new number in the input data brings with it a new model parameter that we'll need to find. In general form, if we have many parameters (let's denote this "many" as $m$ ), the formula will look like this:

\widehat{y} = k_{1}x_{1} + k_{2}x_{2} + \ldots + k_{m}x_{m} + b

\widehat{y} = \sum_{i = 1}^{m} k_{i}x_{i} + b

All that's left is to figure out how to find the parameters $k_{i}$ and $b$ .

Learning to Learn

First, let's return to the simplest case where we have one parameter. In this form, our formula looks like this:

\widehat{y} = kx + b

As mentioned above, training consists of finding the coefficients $k$ and $b$ . What should these coefficients be? We want our answers $\widehat{y}$ to be as close as possible to the correct answers $y$ . What does "as close as possible" mean? The difference between $y$ and $\widehat{y}$ should be minimal. This can be written as

\left| y - \widehat{y} \right| \rightarrow 0

Working with absolute values is inconvenient, so it's better to use the square instead. Mathematically, this is equivalent (if the absolute value tends to zero, then the square tends to zero as well):

\left( y - \widehat{y} \right)^{2} \rightarrow 0

If we had only one example, the problem would be trivial: choose any pair of $k$ and $b$ such that $k*x + b = y$ . There are infinitely many such pairs, which follows from geometric considerations — infinitely many lines can be drawn through a single point. But we want our model to work not for just one pair of values, but for all possible ones. That is, we need to draw the line so that it is on average as close as possible to all our points:

\frac{ \sum_{i = 1}^{n} \left( y_{i} - \widehat{y_{i}} \right)^{2} }{n} \rightarrow 0

We can ignore the denominator since it doesn't affect anything. Now let's recall how $\widehat{y}$ is calculated:

\sum_{i = 1}^{n}\left( y_{i} - \left( kx_{i} + b \right) \right)^{2} \rightarrow 0

Optimizing such a function is called the least squares method.

Illustration of the least squares method

In the illustration, blue dots represent the actual values. Red dots — the model's predictions. The red line — the graph of the function $y = kx + b$ . We try to minimize the sum of the distances between the red and blue dots (dashed lines).

Optimizing

Let's look at our function:

\sum_{i = 1}^{n}\left( y_{i} - \left( kx_{i} + b \right) \right)^{2}

What does it depend on? The input data $x_{i}$ and the correct answers $y_{i}$ are fixed. Therefore, the sum depends only on our parameters $k$ and $b$ . It turns out we have a function of two variables, and we need to find its minimum. If we draw the surface described by this function, we get the following picture:

Surface described by a quadratic function

If we recall school mathematics again, the extremum (minimum or maximum) of a function is found where the derivative equals zero (if such a point exists for the function). This is exactly our case: from the graph, we see that there is only one extremum, and it will be the minimum.

The most important thing is that we can compute derivatives separately for each of our variables. First, we compute the derivative with respect to $b$ . Our sum is simply an addition operation. Let's recall the chain rule:

\frac{d}{dx} f(g(x)) = \frac{d}{dg} f(g(x)) \cdot \frac{d}{dx} g(x)

First, we compute with respect to $b$ :

\frac{d}{db}\sum_{i = 1}^{n}\left( y_{i} - \left( kx_{i} + b \right) \right)^{2} = - 2\sum_{i = 1}^{n}\left( y_{i} - \left( kx_{i} + b \right) \right)

Let's remove the minus in front of the sum by simply swapping the expressions in parentheses:

-2\sum_{i = 1}^{n}\left( y_{i} - \left( kx_{i} + b \right) \right) = 2\sum_{i = 1}^{n}\left( \left( kx_{i} + b \right) - y_{i} \right)

Since we no longer have a square, we can expand the expression into several sums:

2\sum_{i = 1}^{n}\left( \left( kx_{i} + b \right) - y_{i} \right) = 2\sum_{i = 1}^{n}kx_{i} + 2\sum_{i = 1}^{n} b-2 \sum_{i = 1}^{n} y_{i}

Note that $2\sum_{i = 1}^{n}b$ is simply adding $b$ to itself $2n$ times, that is, $2*n*b$ .

Now recall that we're looking for where the derivative equals zero. So let's set our expression equal to zero:

2\sum_{i = 1}^{n}\left( kx_{i} \right) + 2nb - 2\sum_{i = 1}^{n}{y_{i} = 0}

Cancel out the twos and rearrange the terms to express $b$ :

\sum_{i = 1}^{n}\left( kx_{i} \right) + nb - \sum_{i = 1}^{n}{y_{i} = 0}

nb = \sum_{i = 1}^{n} y_{i} - \sum_{i = 1}^{n} kx_{i}

b = \frac{\sum_{i = 1}^{n} y_{i} - k\sum_{i = 1}^{n} x_{i} }{n}

Now let's compute the derivative with respect to $k$ in the same way:

\frac{d}{dk} \sum_{i = 1}^{n} \left( y_{i} - \left( kx_{i} + b \right) \right)^{2} = - 2\sum_{i = 1}^{n}\left( \left( y_{i} - \left( kx_{i} + b \right) \right)x_{i} \right)

Rearrange the terms to remove the minus:

-2\sum_{i = 1}^{n}\left( \left( y_{i} - \left( kx_{i} + b \right) \right) \cdot x_{i} \right) = 2\sum_{i = 1}^{n}{\left( \left( kx_{i} + b - y_{i} \right)x_{i} \right)}

Set it equal to zero and cancel the two:

2\sum_{i = 1}^{n} \left( \left( kx_{i} + b - y_{i} \right) \cdot x_{i} \right) = 0

\sum_{i = 1}^{n}\left( \left( kx_{i} + b - y_{i} \right) \cdot x_{i} \right) = 0

As a result, we get a system of two equations:

\begin{cases} \sum_{i = 1}^{n} \left( \left(k * x_{i} + b - y_{i}\right) \cdot x_{i} \right) = 0 \\ b = \frac{ \sum_{i = 1}^{n} y_{i} - k \sum_{i = 1}^{n} x_{i} }{n} \end{cases}

This system can be simplified a bit. Let's look at the second equation. Let's represent it as:

b = \frac{ \sum_{i = 1}^{n} y_{i} }{n} - k \sdot \frac{ \sum_{i = 1}^{n} x_{i} }{n}

What is $\frac{\sum_{i = 1}^{n}y_{i}}{n}$ ? It's the mean value of $y$ (because we're dividing the sum of all elements by their count). Let's denote it as $\overline{y}$ . We'll do the same for $x$ . As a result, we get:

b = \overline{y} - k \overline{x}

Agree: without all those sums, it's much simpler. Now let's do what we did in school: substitute $b$ into the first equation.

\sum_{i = 1}^{n} \left( \left( kx_{i} + b - y_{i} \right) \cdot x_{i} \right) = 0

\sum_{i = 1}^{n} \left( \left( kx_{i} + \overline{y} - k\overline{x} - y_{i} \right) \cdot x_{i} \right) = 0

Let's simplify everything we can and expand the brackets:

\sum_{i = 1}^{n} \left( \left( k{(x}_{i} - \overline{x}) + \overline{y} - y_{i} \right) \cdot x_{i} \right) = 0

k \sum_{i = 1}^{n} \left( \left( x_{i} - \overline{x} \right) \cdot x_{i} \right) + \sum_{i = 1}^{n} \left( \left( \overline{y} - y_{i} \right) \cdot x_{i} \right) = 0

In principle, this is already enough to calculate everything. We can simply move the second term to the other side and divide:

k \sum_{i = 1}^{n} \left( \left( x_{i} - \overline{x} \right) \cdot x_{i} \right) = - \sum_{i = 1}^{n} \left( \left( \overline{y} - y_{i} \right) \cdot x_{i} \right)

k = - \frac{ \sum_{i = 1}^{n} \left( \left( \overline{y} - y_{i} \right) \sdot x_{i} \right) }{ \sum_{i = 1}^{n} \left( \left( x_{i} - \overline{x} \right) \cdot x_{i} \right) }

The formula can be simplified further. I'll show exactly how at the end of the article, so as not to add even more formulas here. As a result, we get:

\begin{cases} k = \frac{ \sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right) \left(y_{i} - \overline{y} \right) }{ \sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right)^{2} } \\ b = \overline{y} - k\overline{x} \end{cases}

Thus, for any set of $x$ and $y$ , we can calculate our coefficients. After all, $k$ doesn't depend on $b$ . We calculate it using the formula, and then, knowing $k$ , we find $b$ . And now we have our first working machine learning algorithm.

Now that we have all our coefficients, we can easily calculate the answer for any input parameters using the same formula:

\widehat{y} = kx + b

Example Usage

Let's return to the problem of determining a person's weight based on their height. Suppose we have the following dataset:

Height (x), cm	Weight (y), kg
154	43
196	107
172	73
185	80
161	66

First, let's find the mean values. For $x$ it will be $\overline{x} = \ 173.6$ , for $y$ — $\overline{y} = \ 73.8$ . Now we need to find the coefficients. Let's start with $k$ . For each element, we calculate:

$\mathbf{x}$	$\mathbf{y}$	$\mathbf{x - \overline{x}}$	$\mathbf{y - \overline{y}}$
154	43	-19.6	-30.8
196	107	22.4	33.2
172	73	-1.6	-0.8
185	80	11.4	6.2
161	66	-12.6	-7.8

Now we have everything to calculate $k$ using the formula:

k = \frac{ \sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right) \left( y_{i} - \overline{y} \right) }{ \sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right)^{2} }

As a result, we get $k = \frac{1\ 517.6}{1177.2} \approx 1.29$

Now let's calculate $b$ :

b = \overline{y} - k \cdot \overline{x} = 73.8 - 1.29 \cdot 173.6 = - 150.144

Now that we have all the coefficients, we can predict the weight for any height. Let's say we have a basketball player who is 210 centimeters tall. Let's try to predict their weight:\

210 \cdot 1.29 - 150.144 \approx 120.75

Looks quite plausible.

Generalizing

Now that we know how to find coefficients for the one-dimensional case, let's try to do it for the multidimensional case. Then our main formula will be:

\widehat{y} = k_{1}x_{1} + k_{2}x_{2} + \ldots + k_{m}x_{m} + b

And the function we'll optimize:

\sum_{i = 1}^{n} \left( y_{i} - \left( k_{1}x_{1i} + k_{2}x_{2i} + \ldots + k_{m}*x_{mi} + b \right) \right)^{2}

The optimization itself follows the same principle. We compute $m$ derivatives with respect to each coefficient $k_{i}$ , and one derivative with respect to $b$ . As a result, we get a system of $m + 1$ equations with $m + 1$ unknowns.

Solving such a system is already beyond the scope of school mathematics, but it's entirely feasible. As a result, we'll find all our coefficients in the same way.

If in the case of one argument our function defined a straight line, in the case of two arguments it will be a plane, and with more arguments — a hyperplane.

Not Everything Is So Rosy

It would seem that if training is so fast and simple (just calculate a few formulas), why isn't linear regression used everywhere? The answer lies in the word "linear." And now let's try to figure out why.

Let's consider the simplest case again. The dependence of $y$ on $x$ is linear. This means we can make good predictions only if the real-world dependency between the data is linear or close to it. If there are significant deviations from a linear relationship, the linear regression model will still produce some result, but it will differ substantially from reality.

In the image above, we see a nonlinear distribution of real data (blue dots). The model was trained, trying to minimize the distances between the real data and the predictions (red dots). But even though the distance is the minimum possible for this case, the model's prediction can be either relatively accurate (2nd and 4th points) or very inaccurate depending on the region.

Variants with two or more arguments are also subject to this limitation (but instead of a straight line, it's a plane or hyperplane).

It is precisely because of this characteristic that linear regression is used relatively rarely. And it's only applied after verifying that the data indeed has a linear relationship. However, when such a relationship exists, we get a very fast tool — much faster than any other machine learning model.

For example, linear regression is very often used in economics, where the linearity of the relationship is well known: forecasting sales, assets, GDP, and so on. It is also used by banks and insurance companies. In this field, it's important not only to make a prediction but also to explain it. And linear regression is the best fit for this. All of its parameters are visible and understandable. Unlike, for example, neural networks, which, as we'll see in the following articles of this series, are a "black box" to an outside observer.

Afterword

As promised, I'm showing how the coefficient formulas can be simplified. For this, we'll need a couple of mathematical tricks and some knowledge of statistics.

Let me remind you of the original expression:

k \sum_{i = 1}^{n} \left( \left( x_{i} - \overline{x} \right) \cdot x_{i} \right) + \sum_{i = 1}^{n} \left( \left( \overline{y} - y_{i} \right) \cdot x_{i} \right) = 0

We'll consider only its first term. Let's add the mean value to $x_{i}$ , and immediately subtract it for compensation:

k \sum_{i = 1}^{n} \left( \left( x_{i} - \overline{x} \right) \cdot x_{i} \right) = k \sum_{i = 1}^{n} \left( \left(x_{i} - \overline{x} \right) \cdot \left(x_{i} + \overline{x} - \overline{x} \right) \right)

Expand the brackets:

k \sum_{i = 1}^{n} \left( \left( x_{i} - \overline{x} \right) \cdot \left(x_{i} + \overline{x} - \overline{x} \right) \right) = k \sum_{i = 1}^{n} \left( x_{i}^{2} + x_{i} \overline{x} - x_{i} \overline{x} - x_{i} \overline{x} - {\overline{x}}^{2} + {\overline{x}}^{2} \right)

Divide the sum into two parts:

k \sum_{i = 1}^{n} \left( x_{i}^{2} + x_{i} \overline{x} - x_{i} \overline{x} - x_{i} \overline{x} - {\overline{x}}^{2} + {\overline{x}}^{2} \right) = k \sum_{i = 1}^{n} \left( x_{i}^{2} - x_{i} \overline{x} - x_{i} \overline{x} + {\overline{x}}^{2} \right) + k \sum_{i = 1}^{n} \left( x_{i}\overline{x} - {\overline{x}}^{2} \right)

In the first part, we see the formula for the square of a difference:

k \sum_{i = 1}^{n} \left( x_{i}^{2} - x_{i} \overline{x} - x_{i}\overline{x} + {\overline{x}}^{2} \right) = k \sum_{i = 1}^{n} \left( x_{i}^{2} - 2 x_{i} \overline{x} + {\overline{x}}^{2} \right) = k \sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right)^{2}

Now let's deal with the second part. We factor out $\overline{x}$ from the sum:

k \sum_{i = 1}^{n} \left( x_{i}\overline{x} - {\overline{x}}^{2} \right) = k \overline{x}\sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right)

And now comes that very trick from statistics. $\left( x_{i} - \overline{x} \right)$ is the deviation from the mean. And the most interesting thing is that the sum of such deviations is always zero (this follows from statistics):

\sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right) = 0

So in this part of the equation, only one term remains.

Let's return to the original equation and process its second part in the same way:

\sum_{i = 1}^{n} \left( \left( \overline{y} - y_{i} \right) \cdot x_{i} \right) = \sum_{i = 1}^{n} \left( \left( \overline{y} - y_{i} \right) \cdot \left( x_{i} - \overline{x} + \overline{x} \right) \right) = \sum_{i = 1}^{n} \left( \overline{y} - y_{i} \right) \cdot \left( x_{i} - \overline{x} \right) + \overline{x} \sum_{i = 1}^{n} \left( \overline{y} - y_{i} \right)

The second term will also become zero, and we get the final equation:

k \sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right)^{2} + \sum_{i = 1}^{n} \left( \overline{y} - y_{i} \right)\cdot \left( x_{i} - \overline{x} \right) = 0

From here, moving the second term to the right, changing its sign, and dividing, we get:

k \sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right)^{2} = \sum_{i = 1}^{n} \left( y_{i} - \overline{y} \right) \cdot \left( x_{i} - \overline{x} \right)

k = \frac{ \sum_{i = 1}^{n} \left( y_{i} - \overline{y} \right)*\left( x_{i} - \overline{x} \right) }{ \sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right)^{2} }

Questions and Answers

How We Taught Neural Networks to Reproduce Physically Accurate Textures

Describing an approach that converts complex multi-layered materials into compact neural representations and renders them in real time using ray tracing and hardware acceleration on GPU tensor cores.

Linear Regression as Simple as It Gets

Understanding the Terminology

Learning to Learn

Optimizing

Example Usage

Generalizing

Not Everything Is So Rosy

Afterword

Questions and Answers

Ready to discuss your project?

We will be happy to advise you in any of the available ways.

Linear Regression as Simple as It Gets

Understanding the Terminology

Learning to Learn

Optimizing

Example Usage

Generalizing

Not Everything Is So Rosy

Afterword

Questions and Answers

What does "linear" mean in linear regression?

Why is the squared difference used in the loss function instead of the absolute value?

What is the least squares method?

Why do we set the derivatives equal to zero?

How does linear regression work with multiple input features?

Why is linear regression rarely used?

What is a vector in the input data?

Why do we need the variable bb (the intercept)?

Ready to discuss your project?

We will be happy to advise you in any of the available ways.