
Linear Regression as Simple as It Gets
In this article, we will look at one of the simplest machine learning algorithms, namely — linear regression.
This article is part of a series on the fundamentals of machine learning.
In the previous article, we looked at machine learning in general, without going into the details of how it works. In this article, we'll start exploring specific algorithms. And we'll begin with what I consider the simplest model — linear regression.
Understanding the Terminology
First, let's figure out what problem we're solving and how. So, we're solving a regression problem. We discussed what this kind of problem is in the previous article, but just in case, here's a reminder. We have some data . It can be a single number or several (such a set of numbers is called a vector). We want to obtain a single value for as the output, which we denote as . This value is continuous and can be in any range.
As an example of such a problem, consider determining a person's weight based on their height. Then our input data is a single number — the person's height in centimeters. The output is also a single number — weight in kilograms. How do we calculate this value?
The second word in the method's name gives us a hint. Our regression is linear. What does this mean? It means that to determine the output value, we'll use the formula of a straight line. Here it's worth recalling the school curriculum, which states that a straight line is defined by the following expression:
This formula is easy to adapt for our case. Simply denote as the person's weight, and as their height. Actually, we need to slightly adjust the notation. Usually, denotes the correct answer, that is, in our case, the person's actual weight. We, however, are trying to make a prediction, and our result, as mentioned above, will be denoted not as , but as . But let's return from notation to the essence. All we need to do is find the coefficients and , and we can approximately solve the problem.
But the solution won't be very accurate (and most likely — very inaccurate). The thing is, height and weight are not directly related by a linear dependency. There are tall and thin people, and there are short and heavy people. Such a simple formula will work poorly. To increase accuracy, we need to make the model more complex. Let's add waist circumference. Now our input data is no longer a single number , but a vector : denotes height, and — waist circumference. Our output data remains the same — weight in kilograms. The formula changes, but not radically:
For successful training, we now need to find not two, but three parameters: and . To further improve accuracy, we can keep adding input data, for example, leg length, chest circumference, and so on. Moreover, each new number in the input data brings with it a new model parameter that we'll need to find. In general form, if we have many parameters (let's denote this "many" as ), the formula will look like this:
or
All that's left is to figure out how to find the parameters and .
Learning to Learn
First, let's return to the simplest case where we have one parameter. In this form, our formula looks like this:
As mentioned above, training consists of finding the coefficients and . What should these coefficients be? We want our answers to be as close as possible to the correct answers . What does "as close as possible" mean? The difference between and should be minimal. This can be written as
Working with absolute values is inconvenient, so it's better to use the square instead. Mathematically, this is equivalent (if the absolute value tends to zero, then the square tends to zero as well):
If we had only one example, the problem would be trivial: choose any pair of and such that . There are infinitely many such pairs, which follows from geometric considerations — infinitely many lines can be drawn through a single point. But we want our model to work not for just one pair of values, but for all possible ones. That is, we need to draw the line so that it is on average as close as possible to all our points:
We can ignore the denominator since it doesn't affect anything. Now let's recall how is calculated:
Optimizing such a function is called the least squares method.

In the illustration, blue dots represent the actual values. Red dots — the model's predictions. The red line — the graph of the function . We try to minimize the sum of the distances between the red and blue dots (dashed lines).
Optimizing
Let's look at our function:
What does it depend on? The input data and the correct answers are fixed. Therefore, the sum depends only on our parameters and . It turns out we have a function of two variables, and we need to find its minimum. If we draw the surface described by this function, we get the following picture:

If we recall school mathematics again, the extremum (minimum or maximum) of a function is found where the derivative equals zero (if such a point exists for the function). This is exactly our case: from the graph, we see that there is only one extremum, and it will be the minimum.
The most important thing is that we can compute derivatives separately for each of our variables. First, we compute the derivative with respect to . Our sum is simply an addition operation. Let's recall the chain rule:
First, we compute with respect to :
Let's remove the minus in front of the sum by simply swapping the expressions in parentheses:
Since we no longer have a square, we can expand the expression into several sums:
Note that is simply adding to itself times, that is, .
Now recall that we're looking for where the derivative equals zero. So let's set our expression equal to zero:
Cancel out the twos and rearrange the terms to express :
Now let's compute the derivative with respect to in the same way:
Rearrange the terms to remove the minus:
Set it equal to zero and cancel the two:
As a result, we get a system of two equations:
This system can be simplified a bit. Let's look at the second equation. Let's represent it as:
What is ? It's the mean value of (because we're dividing the sum of all elements by their count). Let's denote it as . We'll do the same for . As a result, we get:
Agree: without all those sums, it's much simpler. Now let's do what we did in school: substitute into the first equation.
Let's simplify everything we can and expand the brackets:
In principle, this is already enough to calculate everything. We can simply move the second term to the other side and divide:
The formula can be simplified further. I'll show exactly how at the end of the article, so as not to add even more formulas here. As a result, we get:
Thus, for any set of and , we can calculate our coefficients. After all, doesn't depend on . We calculate it using the formula, and then, knowing , we find . And now we have our first working machine learning algorithm.
Now that we have all our coefficients, we can easily calculate the answer for any input parameters using the same formula:
Example Usage
Let's return to the problem of determining a person's weight based on their height. Suppose we have the following dataset:
| Height (x), cm | Weight (y), kg |
|---|---|
| 154 | 43 |
| 196 | 107 |
| 172 | 73 |
| 185 | 80 |
| 161 | 66 |
First, let's find the mean values. For it will be , for — . Now we need to find the coefficients. Let's start with . For each element, we calculate:
| 154 | 43 | -19.6 | -30.8 |
| 196 | 107 | 22.4 | 33.2 |
| 172 | 73 | -1.6 | -0.8 |
| 185 | 80 | 11.4 | 6.2 |
| 161 | 66 | -12.6 | -7.8 |
Now we have everything to calculate using the formula:
As a result, we get
Now let's calculate :
Now that we have all the coefficients, we can predict the weight for any height. Let's say we have a basketball player who is 210 centimeters tall. Let's try to predict their weight:\
Looks quite plausible.
Generalizing
Now that we know how to find coefficients for the one-dimensional case, let's try to do it for the multidimensional case. Then our main formula will be:
And the function we'll optimize:
The optimization itself follows the same principle. We compute derivatives with respect to each coefficient , and one derivative with respect to . As a result, we get a system of equations with unknowns.
Solving such a system is already beyond the scope of school mathematics, but it's entirely feasible. As a result, we'll find all our coefficients in the same way.
If in the case of one argument our function defined a straight line, in the case of two arguments it will be a plane, and with more arguments — a hyperplane.
Not Everything Is So Rosy
It would seem that if training is so fast and simple (just calculate a few formulas), why isn't linear regression used everywhere? The answer lies in the word "linear." And now let's try to figure out why.
Let's consider the simplest case again. The dependence of on is linear. This means we can make good predictions only if the real-world dependency between the data is linear or close to it. If there are significant deviations from a linear relationship, the linear regression model will still produce some result, but it will differ substantially from reality.

In the image above, we see a nonlinear distribution of real data (blue dots). The model was trained, trying to minimize the distances between the real data and the predictions (red dots). But even though the distance is the minimum possible for this case, the model's prediction can be either relatively accurate (2nd and 4th points) or very inaccurate depending on the region.
Variants with two or more arguments are also subject to this limitation (but instead of a straight line, it's a plane or hyperplane).
It is precisely because of this characteristic that linear regression is used relatively rarely. And it's only applied after verifying that the data indeed has a linear relationship. However, when such a relationship exists, we get a very fast tool — much faster than any other machine learning model.
For example, linear regression is very often used in economics, where the linearity of the relationship is well known: forecasting sales, assets, GDP, and so on. It is also used by banks and insurance companies. In this field, it's important not only to make a prediction but also to explain it. And linear regression is the best fit for this. All of its parameters are visible and understandable. Unlike, for example, neural networks, which, as we'll see in the following articles of this series, are a "black box" to an outside observer.
Afterword
As promised, I'm showing how the coefficient formulas can be simplified. For this, we'll need a couple of mathematical tricks and some knowledge of statistics.
Let me remind you of the original expression:
We'll consider only its first term. Let's add the mean value to , and immediately subtract it for compensation:
Expand the brackets:
Divide the sum into two parts:
In the first part, we see the formula for the square of a difference:
Now let's deal with the second part. We factor out from the sum:
And now comes that very trick from statistics. is the deviation from the mean. And the most interesting thing is that the sum of such deviations is always zero (this follows from statistics):
So in this part of the equation, only one term remains.
Let's return to the original equation and process its second part in the same way:
The second term will also become zero, and we get the final equation:
From here, moving the second term to the right, changing its sign, and dividing, we get:
Questions and Answers
It means the model assumes a linear (straight-line) relationship between the input data and the output. The prediction formula is the equation of a straight line (or a plane/hyperplane equation for multiple features). If the real data follows a more complex pattern, the model will produce inaccurate results.
The absolute value is mathematically inconvenient — it has no derivative at zero, which prevents using standard optimization methods. The squared difference has a derivative everywhere and serves the same purpose: the larger the error, the larger the function's value. Additionally, the square penalizes large errors more heavily, which is often useful in practice.
It's a way of fitting model parameters by minimizing the sum of squared differences between the correct answers and the model's predictions. Geometrically, it means drawing a straight line (or plane) so that the total distance from it to all data points is as small as possible.
From school mathematics, we know that the extremum (minimum or maximum) of a function is found where the derivative equals zero. The loss function of linear regression is a paraboloid — a surface with a single minimum. Therefore, the point where the derivative equals zero is guaranteed to give us a minimum, not a maximum.
The principle remains the same, only the formula expands. Instead of , we get , where each feature has its own coefficient. For two features, the model builds a plane; for more features — a hyperplane. The training process is analogous: compute derivatives with respect to each coefficient, set them equal to zero, and solve the resulting system of equations.
The main limitation is that the model can only describe linear relationships well. In real-world data, the connection between features and the target is often nonlinear, and a straight line (or plane) won't be able to capture it. That's why linear regression is used only where the relationship is indeed close to linear, and always after verifying the data. But in such cases, it works significantly faster than more complex models.
A vector is simply an ordered set of numbers. For example, if we're determining a person's weight from their height and waist circumference, the input data consists of two numbers: , where is height and is waist circumference. This is a two-element vector. The more features we use, the more elements the vector has and the more coefficients we need to find during training.
Without , the formula would look like , meaning the line would always pass through the origin. This severely limits the model — not all relationships pass through the point . The coefficient allows shifting the line up or down, making the model much more flexible and accurate.