
What Do Trees Think About?
In this article, we will start exploring the classification problem. And the first algorithm will be the most intuitively understandable one — decision trees.
This article is part of a series on the fundamentals of machine learning.
- Part 1. About Machine Learning in Simple Terms
- Part 2. Linear Regression as Simple as It Gets
- Part 3. What Do Trees Think About?
In the previous article, we started exploring the simplest machine learning methods using linear regression as an example. Today we will look at another model, but this time for a classification task.
Classification vs Regression
Let's start by recalling the first article and formulate the problem for ourselves. The input data is the same as before: a number or a vector of numbers, which we will denote as . But the output data — — will now change. Previously, it was simply a number. For linear regression, there were no restrictions on either the values or the range. Now, will be able to take not just any value, but only one of the predefined ones. Such values will be called classes. We also have known correct answers for our data, which we will denote as (without the hat).
Let's return to the example from the previous article about predicting weight from height. If before we were predicting the exact weight in kilograms, now we would rather be in the role of judges before boxing matches. That is, we don't care about the exact weight, but rather about the category (light, middle, heavy, and so on).
Another option is binary classification, where there are only two classes. This usually means we are answering some question with "yes" or "no". For example, using the same height and weight data, we could try to answer the question "Is this person taller/heavier than the city average?". This is exactly the binary classification we will focus on.
How to Classify?
So, we have an input vector and only two possible answers. Let's denote the answer "yes" as one, and the answer "no" as zero.
Let's take the simplest case where is a single number. Linear regression won't work here because it produces an answer from a continuous set (spoiler: there is a way to make regression solve our current task, but that's a topic for the next article). So let's come up with something else for now.
Let's visualize our data. It's one-dimensional, so it's easy to display on a number line. Let's mark with blue dots the data for which the correct answer is 0, and with red dots those for which the answer is 1.

In this case, everything is simple. We can choose some value on the line, and if our input number is less than this value, the answer is 0, and if it's greater — 1. How do we choose such a number? Let's try all possible ways to split the data. Usually, the split is made at the midpoint between two points. This gives us 8 options:

Now let's check how many correct answers each option yields. Obviously, in this particular case, we have an option that gives all correct answers:

But this doesn't always happen. Let's change the initial conditions:

In this case, we cannot perfectly separate our data. But we can minimize the error. If we split the data at the same point as before, we get 7 correct answers out of 8. And this is the best we can achieve with a single split.
Building Trees
But there are cases where it's impossible to achieve reasonable quality with just one split. For example:

Here, with one split, we get at most 5 correct answers out of 8, which is not very good. So we need to come up with something else.
The solution seems intuitive: split not once, but multiple times. For example, we can first split like this:

After this split, we consider that all objects below this value have class 1. The remaining part is split once more (the points filled in white were already classified in the previous step, so they can be ignored):

As a result, we get an algorithm that we need to execute to classify any element:

This algorithm is called a decision tree. A tree has a root — this is where the algorithm starts. The points where it ends, that is, the places where we get a class (in the figure above, these are the colored numbers — class labels) are called the leaves of the tree. And all intermediate points (rectangles in the figure) are called nodes. It is the node that describes the splitting of values into two parts.
Now we need to understand how to build such a tree. So, at the very beginning, there are no nodes or leaves. We need to make the first split of the data, that is, create a node. Again, we select points — candidates for splitting. These will be the midpoints of the segments between neighboring data points (just like in the simplest case example):

Now we need to determine which candidate is the most suitable. And simply counting the number of correct answers won't work here. The thing is, if we try to maximize the number of correct answers at the first step (in programming, this is called a greedy algorithm), we might make the task harder at subsequent steps. That is, we need to optimize the entire solution, not just the first step.
So we'll have to take a slightly more complex approach. We will still iterate through all candidates one by one, but the metric will be different. First, we count how many elements end up to the right and to the left of our split:
| Point | ||
|---|---|---|
| A | 1 | 7 |
| B | 2 | 6 |
| C | 3 | 5 |
| D | 4 | 4 |
| E | 5 | 3 |
| F | 6 | 2 |
| G | 7 | 1 |
Let's set these values aside for now, they will be useful a bit later. Now we need to calculate how well the data is separated on the left and on the right. In other words, how effectively we've divided the objects into two classes. For this, we'll use the Gini criterion. It's calculated using the formula
where and are the proportions of classes zero and one in our split.
Suppose we have three zeros and one one on the left. Then
The lower the criterion, the better. For example, if there are only ones or only zeros on the left, the criterion value will be zero. This means that on this side, we have perfectly separated one of the classes.
Let's calculate the criteria for all our candidates:
| Point | ||
|---|---|---|
| A | 0 | 0.41 |
| B | 0 | 0.44 |
| C | 0 | 0.48 |
| D | 0.375 | 0.375 |
| E | 0.48 | 0 |
| F | 0.44 | 0 |
| G | 0.41 | 0 |
Now we just need to calculate the final metric for each candidate. It's calculated as a weighted sum. Remember that we counted the number of elements on the right and left. The weights will be exactly these counts divided by the total number of elements . We have 8 elements in total, so . The final formula:
Calculating for all our candidates:
| Point | M |
|---|---|
| A | 0.36 |
| B | 0.33 |
| C | 0.3 |
| D | 0.375 |
| E | 0.3 |
| F | 0.33 |
| G | 0.36 |
The best metric is for points C and E. We can split at either one, let's split at C.

Now let's see what we got. On the left, there's only class 1. This means we have a leaf here and we don't split further. On the right, there are different classes. This means it's not a leaf, but another node. We repeat the splitting for this node:

After calculations, we'll see that we need to split at E. On the left, we'll have class 0. On the right, class 1. That is, all objects are assigned to classes. We add two leaves. The tree is built.
The final algorithm (initially the tree is empty):
- Add a node at the root
- Perform a split at the node
- If all objects on the left are of the same class, create a leaf, otherwise create a node
- If all objects on the right are of the same class, create a leaf, otherwise create a node
- Repeat steps 2-4 as long as there is at least one node in the tree
Once the tree is built, for each new object we simply traverse it from the root until we reach a leaf. The class in the leaf will be the class for our object.
More Dimensions
We've learned how to build trees for one-dimensional input data, that is, for single numbers. For example, using the tree described in the previous chapter, we could try to determine whether we're looking at an apple or a pear based on a single parameter. For instance, by diameter. However, such classification would be very inaccurate. Diameter alone says almost nothing about whether it's a pear or an apple.

In all the figures below, we'll mark apples in red and pears in blue. From the figure, you can see that it's impossible to split the data into groups of more than one element.
The situation changes dramatically if we add a second dimension. Now each fruit will be described not by one parameter, but by two: width (or diameter) and height. Adding a second parameter will allow us to distinguish apples from pears much more effectively.

Notice that nothing has changed along the X axis. But now the data is very well separated along the Y axis.
If a human were doing the classification, they would look at the elongation (the difference between height and width). To do something similar with machine learning, we can again use decision trees. We just need to figure out how to build a tree for our case.
In fact, the algorithm will be almost the same. At each step, we split the data into two parts. The main difference is that before, we didn't need to choose which parameter to split by, since there was only one. Now there are two parameters, and first we need to decide which one to use for splitting. After that, everything reduces to choosing the split point, which we already learned to do in the previous chapter.
What will such splits look like on a graph? Before, we had only one axis, and we simply placed a point, to the left of which all objects go to one branch of the tree, and to the right — to another. Now we have two axes. And the first step is choosing which axis to split by. If we choose the X axis, it will be a vertical line; if Y — a horizontal line.

Orange and green colors show splits along different features (axes).
At the second stage, we choose exactly where such a line will pass. After that, objects on different sides of the line (above/below or left/right) go to different branches of the tree.
It remains to figure out how exactly to choose the axis. In fact, everything is very simple. We now iterate not only through splitting candidates for one feature, but for all of them. And we choose the best among all options. And when at the second step we need to split along the chosen axis, we already know where to do it, since we calculated that at the first step. If we visualize the tree's operation on the graph, we get this result:

The final algorithm will look like this:
- Add a node at the root
- Find the best split for each feature
- Perform the split at the node using the best feature
- If all objects on the left are of the same class, create a leaf, otherwise create a node
- If all objects on the right are of the same class, create a leaf, otherwise create a node
- Repeat steps 2-5 as long as there is at least one node in the tree
If you look at this algorithm, it becomes clear that nothing prevents us from working with higher-dimensional data. For example, for three-dimensional data, we would separate it not with a line, but with a plane. This can still be visualized. But for four or more dimensions, there is no clear visualization, since the data would be separated by a hyperplane.
Overfitting and What to Do About It
So far, we've been building the tree until all objects are correctly classified. At first glance, this seems like the only right approach — after all, we don't want to make errors. But this is only at first glance. Perfectionism in machine learning does more harm than good.
Let's return to the apple and pear example and imagine that a defective elongated apple made it into our training set. The decision tree, if left as is, will build branches to correctly classify this object as an apple, even though it's surrounded by pears. As a result, when we use this tree for classification, there's a risk that a normal pear will be classified as an apple.

What can we do about this? We need to make the tree ignore single unusual examples. There are several ways to do this:
- Limit the tree depth. For example, we can allow no more than three splits. After three splits, the outermost nodes become leaves.
- Limit the minimum number of data points in a leaf. For example, if a node has four or fewer objects, it automatically becomes a leaf and is not split further.
- Limit the total number of leaves. If the limit is reached, no new splits occur, and all outermost nodes become leaves.
Each of these methods can help combat overfitting. But there is no universal recipe. Usually, a combination of methods is applied, and parameters are tuned experimentally.
For such experiments (and not only for decision trees, but for many other machine learning models as well), having a test set is very useful. What is it? Before training begins, we set aside 10-20 percent of the data as the test set. And we don't use it during training. This is a very important point — the test data must never be seen by the model during training.
When training is complete, we run the test data through the model and see how well it performs. If the results on the training data are very good but poor on the test data, this is most likely overfitting, and measures need to be taken. Usually, the goal is for the percentage of correct answers on the training and test sets to be roughly the same.
Questions and Answers
In regression, the model predicts a continuous number from an arbitrary range (for example, exact weight in kilograms). In classification, the model chooses from predefined options — classes. If there are only two classes, it's binary classification (a "yes" or "no" answer).
It's an algorithm that represents a sequence of checks. A tree consists of three types of elements: a root — the starting point where the algorithm begins; nodes — intermediate points where the data is split into two parts; leaves — endpoints that contain the final class. To classify a new object, you traverse from the root through the nodes, performing checks, until you reach a leaf.
The Gini criterion shows how "pure" the data split is at a node. It's calculated using the formula , where and are the proportions of classes 0 and 1 among the objects. The lower the value, the better the split: means all objects on one side belong to the same class.
This approach is called a greedy algorithm — it optimizes only the current step without considering subsequent ones. The best split at the first stage may lead to poor quality at later stages. That's why a weighted sum of Gini criteria is used: it accounts not only for the "purity" of the split but also for the number of objects on each side.
The algorithm is almost the same. The only difference: at each step, we need to not only choose a split point but also determine which feature to split by. For two features, a split looks like a vertical or horizontal line on a graph. For three features — a plane, and for more — a hyperplane. Candidates are evaluated across all features, and the best option is selected.
Overfitting occurs when a model adapts too closely to the training data, including noise and outliers. For example, if an atypical object appears in the training set, the tree will create separate branches to classify it. As a result, the model will make errors on new data because it has learned the peculiarities of a specific dataset rather than the general pattern.
The main approaches are: limiting tree depth (maximum number of splits), limiting the minimum number of objects in a leaf (a node with few elements is not split further), and limiting the total number of leaves. In practice, several methods are usually combined, and parameters are tuned experimentally using a test set.
A test set is a portion of the data (usually 10–20%) that the model never sees during training. After training, we evaluate the model's quality on this data. If the model performs well on the training data but poorly on the test data, this is a sign of overfitting. Ideally, the quality on both sets should be roughly the same.
Real-Time Character Animation with Machine Learning - An Overview of PFNN, MANN, and LMM
An in-depth look at modern machine learning-based real-time character animation technologies. We examine the PFNN, MANN, and LMM architectures, their principles and pipelines, data requirements, and network structures.
Linear Regression as Simple as It Gets
In this article, we will look at one of the simplest machine learning algorithms, namely — linear regression.