Article preview
Artificial Intelligence
Machine Learning
17 апреля
15 minutes

Linear Regression as Simple as It Gets

In this article, we will look at one of the simplest machine learning algorithms, namely — linear regression.

Vyacheslav Gorash avatar

Vyacheslav Gorash

3D Graphics and Machine Learning Developer with 6 years of experience

This article is part of a series on the fundamentals of machine learning.

In the previous article, we looked at machine learning in general, without going into the details of how it works. In this article, we'll start exploring specific algorithms. And we'll begin with what I consider the simplest model — linear regression.

Understanding the Terminology

First, let's figure out what problem we're solving and how. So, we're solving a regression problem. We discussed what this kind of problem is in the previous article, but just in case, here's a reminder. We have some data XX. It can be a single number or several (such a set of numbers is called a vector). We want to obtain a single value for XX as the output, which we denote as y^\widehat{y}. This value is continuous and can be in any range.

As an example of such a problem, consider determining a person's weight based on their height. Then our input data is a single number — the person's height in centimeters. The output is also a single number — weight in kilograms. How do we calculate this value?

The second word in the method's name gives us a hint. Our regression is linear. What does this mean? It means that to determine the output value, we'll use the formula of a straight line. Here it's worth recalling the school curriculum, which states that a straight line is defined by the following expression:

y=kx+by = kx + b

This formula is easy to adapt for our case. Simply denote yy as the person's weight, and xx as their height. Actually, we need to slightly adjust the notation. Usually, yy denotes the correct answer, that is, in our case, the person's actual weight. We, however, are trying to make a prediction, and our result, as mentioned above, will be denoted not as yy, but as y^\widehat{y}. But let's return from notation to the essence. All we need to do is find the coefficients kk and bb, and we can approximately solve the problem.

But the solution won't be very accurate (and most likely — very inaccurate). The thing is, height and weight are not directly related by a linear dependency. There are tall and thin people, and there are short and heavy people. Such a simple formula will work poorly. To increase accuracy, we need to make the model more complex. Let's add waist circumference. Now our input data is no longer a single number xx, but a vector [x1, x2]\left\lbrack x_{1},\ x_{2} \right\rbrack: x1x_{1} denotes height, and x2x_{2} — waist circumference. Our output data remains the same — weight in kilograms. The formula changes, but not radically:

y^=k1x1+k2x2+b\widehat{y} = k_{1}x_{1} + k_{2}x_{2} + b

For successful training, we now need to find not two, but three parameters: k1, k2k_{1},\ k_{2} and bb. To further improve accuracy, we can keep adding input data, for example, leg length, chest circumference, and so on. Moreover, each new number in the input data brings with it a new model parameter that we'll need to find. In general form, if we have many parameters (let's denote this "many" as mm), the formula will look like this:

y^=k1x1+k2x2++kmxm+b\widehat{y} = k_{1}x_{1} + k_{2}x_{2} + \ldots + k_{m}x_{m} + b

or

y^=i=1mkixi+b\widehat{y} = \sum_{i = 1}^{m} k_{i}x_{i} + b

All that's left is to figure out how to find the parameters kik_{i} and bb.

Learning to Learn

First, let's return to the simplest case where we have one parameter. In this form, our formula looks like this:

y^=kx+b\widehat{y} = kx + b

As mentioned above, training consists of finding the coefficients kk and bb. What should these coefficients be? We want our answers y^\widehat{y} to be as close as possible to the correct answers yy. What does "as close as possible" mean? The difference between yy and y^\widehat{y} should be minimal. This can be written as

yy^0\left| y - \widehat{y} \right| \rightarrow 0

Working with absolute values is inconvenient, so it's better to use the square instead. Mathematically, this is equivalent (if the absolute value tends to zero, then the square tends to zero as well):

(yy^)20\left( y - \widehat{y} \right)^{2} \rightarrow 0

If we had only one example, the problem would be trivial: choose any pair of kk and bb such that kx+b=yk*x + b = y. There are infinitely many such pairs, which follows from geometric considerations — infinitely many lines can be drawn through a single point. But we want our model to work not for just one pair of values, but for all possible ones. That is, we need to draw the line so that it is on average as close as possible to all our points:

i=1n(yiyi^)2n0\frac{ \sum_{i = 1}^{n} \left( y_{i} - \widehat{y_{i}} \right)^{2} }{n} \rightarrow 0

We can ignore the denominator since it doesn't affect anything. Now let's recall how y^\widehat{y} is calculated:

i=1n(yi(kxi+b))20\sum_{i = 1}^{n}\left( y_{i} - \left( kx_{i} + b \right) \right)^{2} \rightarrow 0

Optimizing such a function is called the least squares method.

Illustration of the least squares method

In the illustration, blue dots represent the actual values. Red dots — the model's predictions. The red line — the graph of the function y=kx+by = kx + b. We try to minimize the sum of the distances between the red and blue dots (dashed lines).

Optimizing

Let's look at our function:

i=1n(yi(kxi+b))2\sum_{i = 1}^{n}\left( y_{i} - \left( kx_{i} + b \right) \right)^{2}

What does it depend on? The input data xix_{i} and the correct answers yiy_{i} are fixed. Therefore, the sum depends only on our parameters kk and bb. It turns out we have a function of two variables, and we need to find its minimum. If we draw the surface described by this function, we get the following picture:

Surface described by a quadratic function

If we recall school mathematics again, the extremum (minimum or maximum) of a function is found where the derivative equals zero (if such a point exists for the function). This is exactly our case: from the graph, we see that there is only one extremum, and it will be the minimum.

The most important thing is that we can compute derivatives separately for each of our variables. First, we compute the derivative with respect to bb. Our sum is simply an addition operation. Let's recall the chain rule:

ddxf(g(x))=ddgf(g(x))ddxg(x)\frac{d}{dx} f(g(x)) = \frac{d}{dg} f(g(x)) \cdot \frac{d}{dx} g(x)

First, we compute with respect to bb:

ddbi=1n(yi(kxi+b))2=2i=1n(yi(kxi+b))\frac{d}{db}\sum_{i = 1}^{n}\left( y_{i} - \left( kx_{i} + b \right) \right)^{2} = - 2\sum_{i = 1}^{n}\left( y_{i} - \left( kx_{i} + b \right) \right)

Let's remove the minus in front of the sum by simply swapping the expressions in parentheses:

2i=1n(yi(kxi+b))=2i=1n((kxi+b)yi)-2\sum_{i = 1}^{n}\left( y_{i} - \left( kx_{i} + b \right) \right) = 2\sum_{i = 1}^{n}\left( \left( kx_{i} + b \right) - y_{i} \right)

Since we no longer have a square, we can expand the expression into several sums:

2i=1n((kxi+b)yi)=2i=1nkxi+2i=1nb2i=1nyi2\sum_{i = 1}^{n}\left( \left( kx_{i} + b \right) - y_{i} \right) = 2\sum_{i = 1}^{n}kx_{i} + 2\sum_{i = 1}^{n} b-2 \sum_{i = 1}^{n} y_{i}

Note that 2i=1nb2\sum_{i = 1}^{n}b is simply adding bb to itself 2n2n times, that is, 2nb2*n*b.

Now recall that we're looking for where the derivative equals zero. So let's set our expression equal to zero:

2i=1n(kxi)+2nb2i=1nyi=02\sum_{i = 1}^{n}\left( kx_{i} \right) + 2nb - 2\sum_{i = 1}^{n}{y_{i} = 0}

Cancel out the twos and rearrange the terms to express bb:

i=1n(kxi)+nbi=1nyi=0\sum_{i = 1}^{n}\left( kx_{i} \right) + nb - \sum_{i = 1}^{n}{y_{i} = 0}nb=i=1nyii=1nkxinb = \sum_{i = 1}^{n} y_{i} - \sum_{i = 1}^{n} kx_{i}b=i=1nyiki=1nxinb = \frac{\sum_{i = 1}^{n} y_{i} - k\sum_{i = 1}^{n} x_{i} }{n}

Now let's compute the derivative with respect to kk in the same way:

ddki=1n(yi(kxi+b))2=2i=1n((yi(kxi+b))xi)\frac{d}{dk} \sum_{i = 1}^{n} \left( y_{i} - \left( kx_{i} + b \right) \right)^{2} = - 2\sum_{i = 1}^{n}\left( \left( y_{i} - \left( kx_{i} + b \right) \right)x_{i} \right)

Rearrange the terms to remove the minus:

2i=1n((yi(kxi+b))xi)=2i=1n((kxi+byi)xi)-2\sum_{i = 1}^{n}\left( \left( y_{i} - \left( kx_{i} + b \right) \right) \cdot x_{i} \right) = 2\sum_{i = 1}^{n}{\left( \left( kx_{i} + b - y_{i} \right)x_{i} \right)}

Set it equal to zero and cancel the two:

2i=1n((kxi+byi)xi)=02\sum_{i = 1}^{n} \left( \left( kx_{i} + b - y_{i} \right) \cdot x_{i} \right) = 0i=1n((kxi+byi)xi)=0\sum_{i = 1}^{n}\left( \left( kx_{i} + b - y_{i} \right) \cdot x_{i} \right) = 0

As a result, we get a system of two equations:

{i=1n((kxi+byi)xi)=0b=i=1nyiki=1nxin\begin{cases} \sum_{i = 1}^{n} \left( \left(k * x_{i} + b - y_{i}\right) \cdot x_{i} \right) = 0 \\ b = \frac{ \sum_{i = 1}^{n} y_{i} - k \sum_{i = 1}^{n} x_{i} }{n} \end{cases}

This system can be simplified a bit. Let's look at the second equation. Let's represent it as:

b=i=1nyinki=1nxinb = \frac{ \sum_{i = 1}^{n} y_{i} }{n} - k \sdot \frac{ \sum_{i = 1}^{n} x_{i} }{n}

What is i=1nyin\frac{\sum_{i = 1}^{n}y_{i}}{n}? It's the mean value of yy (because we're dividing the sum of all elements by their count). Let's denote it as y\overline{y}. We'll do the same for xx. As a result, we get:

b=ykxb = \overline{y} - k \overline{x}

Agree: without all those sums, it's much simpler. Now let's do what we did in school: substitute bb into the first equation.

i=1n((kxi+byi)xi)=0\sum_{i = 1}^{n} \left( \left( kx_{i} + b - y_{i} \right) \cdot x_{i} \right) = 0i=1n((kxi+ykxyi)xi)=0\sum_{i = 1}^{n} \left( \left( kx_{i} + \overline{y} - k\overline{x} - y_{i} \right) \cdot x_{i} \right) = 0

Let's simplify everything we can and expand the brackets:

i=1n((k(xix)+yyi)xi)=0\sum_{i = 1}^{n} \left( \left( k{(x}_{i} - \overline{x}) + \overline{y} - y_{i} \right) \cdot x_{i} \right) = 0ki=1n((xix)xi)+i=1n((yyi)xi)=0k \sum_{i = 1}^{n} \left( \left( x_{i} - \overline{x} \right) \cdot x_{i} \right) + \sum_{i = 1}^{n} \left( \left( \overline{y} - y_{i} \right) \cdot x_{i} \right) = 0

In principle, this is already enough to calculate everything. We can simply move the second term to the other side and divide:

ki=1n((xix)xi)=i=1n((yyi)xi)k \sum_{i = 1}^{n} \left( \left( x_{i} - \overline{x} \right) \cdot x_{i} \right) = - \sum_{i = 1}^{n} \left( \left( \overline{y} - y_{i} \right) \cdot x_{i} \right)k=i=1n((yyi)xi)i=1n((xix)xi)k = - \frac{ \sum_{i = 1}^{n} \left( \left( \overline{y} - y_{i} \right) \sdot x_{i} \right) }{ \sum_{i = 1}^{n} \left( \left( x_{i} - \overline{x} \right) \cdot x_{i} \right) }

The formula can be simplified further. I'll show exactly how at the end of the article, so as not to add even more formulas here. As a result, we get:

{k=i=1n(xix)(yiy)i=1n(xix)2b=ykx\begin{cases} k = \frac{ \sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right) \left(y_{i} - \overline{y} \right) }{ \sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right)^{2} } \\ b = \overline{y} - k\overline{x} \end{cases}

Thus, for any set of xx and yy, we can calculate our coefficients. After all, kk doesn't depend on bb. We calculate it using the formula, and then, knowing kk, we find bb. And now we have our first working machine learning algorithm.

Now that we have all our coefficients, we can easily calculate the answer for any input parameters using the same formula:

y^=kx+b\widehat{y} = kx + b

Example Usage

Let's return to the problem of determining a person's weight based on their height. Suppose we have the following dataset:

Height (x), cmWeight (y), kg
15443
196107
17273
18580
16166

First, let's find the mean values. For xx it will be x= 173.6\overline{x} = \ 173.6, for yyy= 73.8\overline{y} = \ 73.8. Now we need to find the coefficients. Let's start with kk. For each element, we calculate:

x\mathbf{x}y\mathbf{y}xx\mathbf{x - \overline{x}}yy\mathbf{y - \overline{y}}
15443-19.6-30.8
19610722.433.2
17273-1.6-0.8
1858011.46.2
16166-12.6-7.8

Now we have everything to calculate kk using the formula:

k=i=1n(xix)(yiy)i=1n(xix)2k = \frac{ \sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right) \left( y_{i} - \overline{y} \right) }{ \sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right)^{2} }

As a result, we get k=1 517.61177.21.29k = \frac{1\ 517.6}{1177.2} \approx 1.29

Now let's calculate bb:

b=ykx=73.81.29173.6=150.144b = \overline{y} - k \cdot \overline{x} = 73.8 - 1.29 \cdot 173.6 = - 150.144

Now that we have all the coefficients, we can predict the weight for any height. Let's say we have a basketball player who is 210 centimeters tall. Let's try to predict their weight:\

2101.29150.144120.75210 \cdot 1.29 - 150.144 \approx 120.75

Looks quite plausible.

Generalizing

Now that we know how to find coefficients for the one-dimensional case, let's try to do it for the multidimensional case. Then our main formula will be:

y^=k1x1+k2x2++kmxm+b\widehat{y} = k_{1}x_{1} + k_{2}x_{2} + \ldots + k_{m}x_{m} + b

And the function we'll optimize:

i=1n(yi(k1x1i+k2x2i++kmxmi+b))2\sum_{i = 1}^{n} \left( y_{i} - \left( k_{1}x_{1i} + k_{2}x_{2i} + \ldots + k_{m}*x_{mi} + b \right) \right)^{2}

The optimization itself follows the same principle. We compute mm derivatives with respect to each coefficient kik_{i}, and one derivative with respect to bb. As a result, we get a system of m+1m + 1 equations with m+1m + 1 unknowns.

Solving such a system is already beyond the scope of school mathematics, but it's entirely feasible. As a result, we'll find all our coefficients in the same way.

If in the case of one argument our function defined a straight line, in the case of two arguments it will be a plane, and with more arguments — a hyperplane.

Not Everything Is So Rosy

It would seem that if training is so fast and simple (just calculate a few formulas), why isn't linear regression used everywhere? The answer lies in the word "linear." And now let's try to figure out why.

Let's consider the simplest case again. The dependence of yy on xx is linear. This means we can make good predictions only if the real-world dependency between the data is linear or close to it. If there are significant deviations from a linear relationship, the linear regression model will still produce some result, but it will differ substantially from reality.

The problem with nonlinear dependency

In the image above, we see a nonlinear distribution of real data (blue dots). The model was trained, trying to minimize the distances between the real data and the predictions (red dots). But even though the distance is the minimum possible for this case, the model's prediction can be either relatively accurate (2nd and 4th points) or very inaccurate depending on the region.

Variants with two or more arguments are also subject to this limitation (but instead of a straight line, it's a plane or hyperplane).

It is precisely because of this characteristic that linear regression is used relatively rarely. And it's only applied after verifying that the data indeed has a linear relationship. However, when such a relationship exists, we get a very fast tool — much faster than any other machine learning model.

For example, linear regression is very often used in economics, where the linearity of the relationship is well known: forecasting sales, assets, GDP, and so on. It is also used by banks and insurance companies. In this field, it's important not only to make a prediction but also to explain it. And linear regression is the best fit for this. All of its parameters are visible and understandable. Unlike, for example, neural networks, which, as we'll see in the following articles of this series, are a "black box" to an outside observer.

Afterword

As promised, I'm showing how the coefficient formulas can be simplified. For this, we'll need a couple of mathematical tricks and some knowledge of statistics.

Let me remind you of the original expression:

ki=1n((xix)xi)+i=1n((yyi)xi)=0k \sum_{i = 1}^{n} \left( \left( x_{i} - \overline{x} \right) \cdot x_{i} \right) + \sum_{i = 1}^{n} \left( \left( \overline{y} - y_{i} \right) \cdot x_{i} \right) = 0

We'll consider only its first term. Let's add the mean value to xix_{i}, and immediately subtract it for compensation:

ki=1n((xix)xi)=ki=1n((xix)(xi+xx))k \sum_{i = 1}^{n} \left( \left( x_{i} - \overline{x} \right) \cdot x_{i} \right) = k \sum_{i = 1}^{n} \left( \left(x_{i} - \overline{x} \right) \cdot \left(x_{i} + \overline{x} - \overline{x} \right) \right)

Expand the brackets:

ki=1n((xix)(xi+xx))=ki=1n(xi2+xixxixxixx2+x2)k \sum_{i = 1}^{n} \left( \left( x_{i} - \overline{x} \right) \cdot \left(x_{i} + \overline{x} - \overline{x} \right) \right) = k \sum_{i = 1}^{n} \left( x_{i}^{2} + x_{i} \overline{x} - x_{i} \overline{x} - x_{i} \overline{x} - {\overline{x}}^{2} + {\overline{x}}^{2} \right)

Divide the sum into two parts:

ki=1n(xi2+xixxixxixx2+x2)=ki=1n(xi2xixxix+x2)+ki=1n(xixx2)k \sum_{i = 1}^{n} \left( x_{i}^{2} + x_{i} \overline{x} - x_{i} \overline{x} - x_{i} \overline{x} - {\overline{x}}^{2} + {\overline{x}}^{2} \right) = k \sum_{i = 1}^{n} \left( x_{i}^{2} - x_{i} \overline{x} - x_{i} \overline{x} + {\overline{x}}^{2} \right) + k \sum_{i = 1}^{n} \left( x_{i}\overline{x} - {\overline{x}}^{2} \right)

In the first part, we see the formula for the square of a difference:

ki=1n(xi2xixxix+x2)=ki=1n(xi22xix+x2)=ki=1n(xix)2k \sum_{i = 1}^{n} \left( x_{i}^{2} - x_{i} \overline{x} - x_{i}\overline{x} + {\overline{x}}^{2} \right) = k \sum_{i = 1}^{n} \left( x_{i}^{2} - 2 x_{i} \overline{x} + {\overline{x}}^{2} \right) = k \sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right)^{2}

Now let's deal with the second part. We factor out x\overline{x} from the sum:

ki=1n(xixx2)=kxi=1n(xix)k \sum_{i = 1}^{n} \left( x_{i}\overline{x} - {\overline{x}}^{2} \right) = k \overline{x}\sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right)

And now comes that very trick from statistics. (xix)\left( x_{i} - \overline{x} \right) is the deviation from the mean. And the most interesting thing is that the sum of such deviations is always zero (this follows from statistics):

i=1n(xix)=0\sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right) = 0

So in this part of the equation, only one term remains.

Let's return to the original equation and process its second part in the same way:

i=1n((yyi)xi)=i=1n((yyi)(xix+x))=i=1n(yyi)(xix)+xi=1n(yyi)\sum_{i = 1}^{n} \left( \left( \overline{y} - y_{i} \right) \cdot x_{i} \right) = \sum_{i = 1}^{n} \left( \left( \overline{y} - y_{i} \right) \cdot \left( x_{i} - \overline{x} + \overline{x} \right) \right) = \sum_{i = 1}^{n} \left( \overline{y} - y_{i} \right) \cdot \left( x_{i} - \overline{x} \right) + \overline{x} \sum_{i = 1}^{n} \left( \overline{y} - y_{i} \right)

The second term will also become zero, and we get the final equation:

ki=1n(xix)2+i=1n(yyi)(xix)=0k \sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right)^{2} + \sum_{i = 1}^{n} \left( \overline{y} - y_{i} \right)\cdot \left( x_{i} - \overline{x} \right) = 0

From here, moving the second term to the right, changing its sign, and dividing, we get:

ki=1n(xix)2=i=1n(yiy)(xix)k \sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right)^{2} = \sum_{i = 1}^{n} \left( y_{i} - \overline{y} \right) \cdot \left( x_{i} - \overline{x} \right)k=i=1n(yiy)(xix)i=1n(xix)2k = \frac{ \sum_{i = 1}^{n} \left( y_{i} - \overline{y} \right)*\left( x_{i} - \overline{x} \right) }{ \sum_{i = 1}^{n} \left( x_{i} - \overline{x} \right)^{2} }

Questions and Answers

Ready to discuss your project?

Describe your task, we will make a research and respond to you as soon as possible.

We will be happy to advise you in any of the available ways.

By leaving a request you agree to the data processing policy