[{"data":1,"prerenderedAt":1431},["ShallowReactive",2],{"article-id-en-ml-basic-3":3},{"id":4,"title":5,"body":6,"description":1195,"extension":1415,"meta":1416,"navigation":1425,"path":37,"seo":1426,"stem":1429,"__hash__":1430},"content/en/blog/ml-basic-3.mdx","Ml Basic 3",{"type":7,"value":8,"toc":1407},"minimark",[9,1184],[10,11,12,16,39,42,47,139,142,145,149,166,183,186,193,196,202,205,211,214,220,223,227,230,236,239,242,248,251,256,259,265,280,283,289,292,295,462,465,519,560,563,657,660,663,792,840,955,958,1019,1022,1028,1031,1037,1040,1043,1061,1064,1068,1071,1077,1080,1083,1089,1092,1095,1098,1101,1107,1110,1113,1116,1122,1125,1142,1145,1149,1152,1155,1161,1164,1175,1178,1181],"section-md",{},[13,14,15],"p",{},"This article is part of a series on the fundamentals of machine learning.",[17,18,19,27,33],"ul",{},[20,21,22],"li",{},[23,24,26],"a",{"href":25},"/en/blog/ml-basic-1","Part 1. About Machine Learning in Simple Terms",[20,28,29],{},[23,30,32],{"href":31},"/en/blog/ml-basic-2","Part 2. Linear Regression as Simple as It Gets",[20,34,35],{},[23,36,38],{"href":37},"/en/blog/ml-basic-3","Part 3. What Do Trees Think About?",[13,40,41],{},"In the previous article, we started exploring the simplest machine learning\nmethods using linear regression as an example. Today we will look at another\nmodel, but this time for a classification task.",[43,44,46],"h2",{"id":45},"classification-vs-regression","Classification vs Regression",[13,48,49,50,73,74,97,98,115,116,120,121,124,125,138],{},"Let's start by recalling the first article and formulate the problem for\nourselves. The input data is the same as before: a number or a vector of\nnumbers, which we will denote as ",[51,52,55],"span",{"className":53},[54],"katex",[56,57,59],"math",{"xmlns":58},"http://www.w3.org/1998/Math/MathML",[60,61,62,69],"semantics",{},[63,64,65],"mrow",{},[66,67,68],"mi",{},"X",[70,71,68],"annotation",{"encoding":72},"application/x-tex",". But the output data — ",[51,75,77],{"className":76},[54],[56,78,79],{"xmlns":58},[60,80,81,94],{},[63,82,83],{},[84,85,87,90],"mover",{"accent":86},"true",[66,88,89],{},"y",[91,92,93],"mo",{"stretchy":86},"^",[70,95,96],{"encoding":72},"\\widehat{y}"," —\nwill now change. Previously, it was simply a number. For linear regression,\nthere were no restrictions on either the values or the range. Now,\n",[51,99,101],{"className":100},[54],[56,102,103],{"xmlns":58},[60,104,105,113],{},[63,106,107],{},[84,108,109,111],{"accent":86},[66,110,89],{},[91,112,93],{"stretchy":86},[70,114,96],{"encoding":72}," will be able to take not just any value, but only ",[117,118,119],"strong",{},"one of\nthe predefined ones",". Such values will be called ",[117,122,123],{},"classes",". We also have\nknown correct answers for our data, which we will denote as ",[51,126,128],{"className":127},[54],[56,129,130],{"xmlns":58},[60,131,132,136],{},[63,133,134],{},[66,135,89],{},[70,137,89],{"encoding":72}," (without the\nhat).",[13,140,141],{},"Let's return to the example from the previous article about predicting weight\nfrom height. If before we were predicting the exact weight in kilograms, now\nwe would rather be in the role of judges before boxing matches. That is, we\ndon't care about the exact weight, but rather about the category (light,\nmiddle, heavy, and so on).",[13,143,144],{},"Another option is binary classification, where there are only two classes.\nThis usually means we are answering some question with \"yes\" or \"no\". For\nexample, using the same height and weight data, we could try to answer the\nquestion \"Is this person taller/heavier than the city average?\". This is\nexactly the binary classification we will focus on.",[43,146,148],{"id":147},"how-to-classify","How to Classify?",[13,150,151,152,165],{},"So, we have an input vector ",[51,153,155],{"className":154},[54],[56,156,157],{"xmlns":58},[60,158,159,163],{},[63,160,161],{},[66,162,68],{},[70,164,68],{"encoding":72}," and only two possible answers. Let's denote\nthe answer \"yes\" as one, and the answer \"no\" as zero.",[13,167,168,169,182],{},"Let's take the simplest case where ",[51,170,172],{"className":171},[54],[56,173,174],{"xmlns":58},[60,175,176,180],{},[63,177,178],{},[66,179,68],{},[70,181,68],{"encoding":72}," is a single number. Linear regression\nwon't work here because it produces an answer from a continuous set (spoiler:\nthere is a way to make regression solve our current task, but that's a topic\nfor the next article). So let's come up with something else for now.",[13,184,185],{},"Let's visualize our data. It's one-dimensional, so it's easy to display on a\nnumber line. Let's mark with blue dots the data for which the correct answer\nis 0, and with red dots those for which the answer is 1.",[13,187,188],{},[189,190],"img",{"alt":191,"src":192},"Classification data visualization 1","/img/blog/ml-basic-3/image1.png",[13,194,195],{},"In this case, everything is simple. We can choose some value on the line, and\nif our input number is less than this value, the answer is 0, and if it's\ngreater — 1. How do we choose such a number? Let's try all possible ways to\nsplit the data. Usually, the split is made at the midpoint between two\npoints. This gives us 8 options:",[13,197,198],{},[189,199],{"alt":200,"src":201},"Classification data visualization 2","/img/blog/ml-basic-3/image2.png",[13,203,204],{},"Now let's check how many correct answers each option yields. Obviously, in\nthis particular case, we have an option that gives all correct answers:",[13,206,207],{},[189,208],{"alt":209,"src":210},"Classification data visualization 3","/img/blog/ml-basic-3/image3.png",[13,212,213],{},"But this doesn't always happen. Let's change the initial conditions:",[13,215,216],{},[189,217],{"alt":218,"src":219},"Classification data with different initial conditions","/img/blog/ml-basic-3/image4.png",[13,221,222],{},"In this case, we cannot perfectly separate our data. But we can minimize the\nerror. If we split the data at the same point as before, we get 7 correct\nanswers out of 8. And this is the best we can achieve with a single split.",[43,224,226],{"id":225},"building-trees","Building Trees",[13,228,229],{},"But there are cases where it's impossible to achieve reasonable quality with\njust one split. For example:",[13,231,232],{},[189,233],{"alt":234,"src":235},"Data where a single split is ineffective","/img/blog/ml-basic-3/image5.png",[13,237,238],{},"Here, with one split, we get at most 5 correct answers out of 8, which is\nnot very good. So we need to come up with something else.",[13,240,241],{},"The solution seems intuitive: split not once, but multiple times. For example,\nwe can first split like this:",[13,243,244],{},[189,245],{"alt":246,"src":247},"Data with multiple splits","/img/blog/ml-basic-3/image6.png",[13,249,250],{},"After this split, we consider that all objects below this value have class 1.\nThe remaining part is split once more (the points filled in white were\nalready classified in the previous step, so they can be ignored):",[13,252,253],{},[189,254],{"alt":246,"src":255},"/img/blog/ml-basic-3/image7.png",[13,257,258],{},"As a result, we get an algorithm that we need to execute to classify any\nelement:",[13,260,261],{},[189,262],{"alt":263,"src":264},"Data splitting algorithm visualization","/img/blog/ml-basic-3/image8.png",[13,266,267,268,271,272,275,276,279],{},"This algorithm is called a decision tree. A tree has a ",[117,269,270],{},"root"," — this is\nwhere the algorithm starts. The points where it ends, that is, the places\nwhere we get a class (in the figure above, these are the colored numbers —\nclass labels) are called the ",[117,273,274],{},"leaves"," of the tree. And all intermediate\npoints (rectangles in the figure) are called ",[117,277,278],{},"nodes",". It is the node that\ndescribes the splitting of values into two parts.",[13,281,282],{},"Now we need to understand how to build such a tree. So, at the very beginning,\nthere are no nodes or leaves. We need to make the first split of the data,\nthat is, create a node. Again, we select points — candidates for splitting.\nThese will be the midpoints of the segments between neighboring data points\n(just like in the simplest case example):",[13,284,285],{},[189,286],{"alt":287,"src":288},"Splitting candidates visualization","/img/blog/ml-basic-3/image9.png",[13,290,291],{},"Now we need to determine which candidate is the most suitable. And simply\ncounting the number of correct answers won't work here. The thing is, if we\ntry to maximize the number of correct answers at the first step (in\nprogramming, this is called a greedy algorithm), we might make the task\nharder at subsequent steps. That is, we need to optimize the entire solution,\nnot just the first step.",[13,293,294],{},"So we'll have to take a slightly more complex approach. We will still iterate\nthrough all candidates one by one, but the metric will be different. First,\nwe count how many elements end up to the right and to the left of our split:",[296,297,298,388],"table",{},[299,300,301],"thead",{},[302,303,304,310,350],"tr",{},[305,306,307],"th",{},[117,308,309],{},"Point",[305,311,312],{},[51,313,315],{"className":314},[54],[56,316,317],{"xmlns":58},[60,318,319,347],{},[63,320,321,324,345],{},[91,322,323],{"fence":86},"∥",[325,326,327,331],"msub",{},[66,328,330],{"mathvariant":329},"bold","S",[63,332,333,336,339,342],{},[66,334,335],{"mathvariant":329},"l",[66,337,338],{"mathvariant":329},"e",[66,340,341],{"mathvariant":329},"f",[66,343,344],{"mathvariant":329},"t",[91,346,323],{"fence":86},[70,348,349],{"encoding":72},"\\left\\| \\mathbf{S}_{\\mathbf{left}} \\right\\|",[305,351,352],{},[51,353,355],{"className":354},[54],[56,356,357],{"xmlns":58},[60,358,359,385],{},[63,360,361,363,383],{},[91,362,323],{"fence":86},[325,364,365,367],{},[66,366,330],{"mathvariant":329},[63,368,369,372,375,378,381],{},[66,370,371],{"mathvariant":329},"r",[66,373,374],{"mathvariant":329},"i",[66,376,377],{"mathvariant":329},"g",[66,379,380],{"mathvariant":329},"h",[66,382,344],{"mathvariant":329},[91,384,323],{"fence":86},[70,386,387],{"encoding":72},"\\left\\| \\mathbf{S}_{\\mathbf{right}} \\right\\|",[389,390,391,403,414,425,435,444,453],"tbody",{},[302,392,393,397,400],{},[394,395,396],"td",{},"A",[394,398,399],{},"1",[394,401,402],{},"7",[302,404,405,408,411],{},[394,406,407],{},"B",[394,409,410],{},"2",[394,412,413],{},"6",[302,415,416,419,422],{},[394,417,418],{},"C",[394,420,421],{},"3",[394,423,424],{},"5",[302,426,427,430,433],{},[394,428,429],{},"D",[394,431,432],{},"4",[394,434,432],{},[302,436,437,440,442],{},[394,438,439],{},"E",[394,441,424],{},[394,443,421],{},[302,445,446,449,451],{},[394,447,448],{},"F",[394,450,413],{},[394,452,410],{},[302,454,455,458,460],{},[394,456,457],{},"G",[394,459,402],{},[394,461,399],{},[13,463,464],{},"Let's set these values aside for now, they will be useful a bit later. Now\nwe need to calculate how well the data is separated on the left and on the\nright. In other words, how effectively we've divided the objects into two\nclasses. For this, we'll use the Gini criterion. It's calculated using the\nformula",[13,466,467],{},[51,468,470],{"className":469},[54],[56,471,472],{"xmlns":58},[60,473,474,516],{},[63,475,476,478,481,484,487],{},[66,477,457],{},[91,479,480],{},"=",[482,483,399],"mn",{},[91,485,486],{},"−",[63,488,489,492,502,505,513],{},[91,490,491],{"fence":86},"(",[493,494,495,497,500],"msubsup",{},[66,496,13],{},[482,498,499],{},"0",[482,501,410],{},[91,503,504],{},"+",[493,506,507,509,511],{},[66,508,13],{},[482,510,399],{},[482,512,410],{},[91,514,515],{"fence":86},")",[70,517,518],{"encoding":72},"G = 1 - \\left( p_{0}^{2} + p_{1}^{2} \\right)",[13,520,521,522,540,541,559],{},"where ",[51,523,525],{"className":524},[54],[56,526,527],{"xmlns":58},[60,528,529,537],{},[63,530,531],{},[325,532,533,535],{},[66,534,13],{},[482,536,499],{},[70,538,539],{"encoding":72},"p_{0}"," and ",[51,542,544],{"className":543},[54],[56,545,546],{"xmlns":58},[60,547,548,556],{},[63,549,550],{},[325,551,552,554],{},[66,553,13],{},[482,555,399],{},[70,557,558],{"encoding":72},"p_{1}"," are the proportions of classes zero and one in\nour split.",[13,561,562],{},"Suppose we have three zeros and one one on the left. Then",[13,564,565],{},[51,566,568],{"className":567},[54],[56,569,570],{"xmlns":58},[60,571,572,654],{},[63,573,574,588,590,592,594,640,642,644,646,649,651],{},[325,575,576,578],{},[66,577,457],{},[63,579,580,582,584,586],{},[66,581,335],{},[66,583,338],{},[66,585,341],{},[66,587,344],{},[91,589,480],{},[482,591,399],{},[91,593,486],{},[63,595,596,598,616,618,622,638],{},[91,597,491],{"fence":86},[599,600,601,614],"msup",{},[63,602,603,605,612],{},[91,604,491],{"fence":86},[606,607,608,610],"mfrac",{},[482,609,421],{},[482,611,432],{},[91,613,515],{"fence":86},[482,615,410],{},[91,617,504],{},[619,620,621],"mtext",{}," ",[599,623,624,636],{},[63,625,626,628,634],{},[91,627,491],{"fence":86},[606,629,630,632],{},[482,631,399],{},[482,633,432],{},[91,635,515],{"fence":86},[482,637,410],{},[91,639,515],{"fence":86},[91,641,480],{},[482,643,399],{},[91,645,486],{},[482,647,648],{},"0.625",[91,650,480],{},[482,652,653],{},"0.375",[70,655,656],{"encoding":72},"G_{left} = 1 - \\left( \\left( \\frac{3}{4} \\right)^{2} + \\ \\left( \\frac{1}{4} \\right)^{2} \\right) = 1 - 0.625 = 0.375",[13,658,659],{},"The lower the criterion, the better. For example, if there are only ones or\nonly zeros on the left, the criterion value will be zero. This means that on\nthis side, we have perfectly separated one of the classes.",[13,661,662],{},"Let's calculate the criteria for all our candidates:",[296,664,665,731],{},[299,666,667],{},[302,668,669,673,701],{},[305,670,671],{},[117,672,309],{},[305,674,675],{},[51,676,678],{"className":677},[54],[56,679,680],{"xmlns":58},[60,681,682,698],{},[63,683,684],{},[325,685,686,688],{},[66,687,457],{"mathvariant":329},[63,689,690,692,694,696],{},[66,691,335],{"mathvariant":329},[66,693,338],{"mathvariant":329},[66,695,341],{"mathvariant":329},[66,697,344],{"mathvariant":329},[70,699,700],{"encoding":72},"\\mathbf{G}_{\\mathbf{left}}",[305,702,703],{},[51,704,706],{"className":705},[54],[56,707,708],{"xmlns":58},[60,709,710,728],{},[63,711,712],{},[325,713,714,716],{},[66,715,457],{"mathvariant":329},[63,717,718,720,722,724,726],{},[66,719,371],{"mathvariant":329},[66,721,374],{"mathvariant":329},[66,723,377],{"mathvariant":329},[66,725,380],{"mathvariant":329},[66,727,344],{"mathvariant":329},[70,729,730],{"encoding":72},"\\mathbf{G}_{\\mathbf{right}}",[389,732,733,742,751,760,768,776,784],{},[302,734,735,737,739],{},[394,736,396],{},[394,738,499],{},[394,740,741],{},"0.41",[302,743,744,746,748],{},[394,745,407],{},[394,747,499],{},[394,749,750],{},"0.44",[302,752,753,755,757],{},[394,754,418],{},[394,756,499],{},[394,758,759],{},"0.48",[302,761,762,764,766],{},[394,763,429],{},[394,765,653],{},[394,767,653],{},[302,769,770,772,774],{},[394,771,439],{},[394,773,759],{},[394,775,499],{},[302,777,778,780,782],{},[394,779,448],{},[394,781,750],{},[394,783,499],{},[302,785,786,788,790],{},[394,787,457],{},[394,789,741],{},[394,791,499],{},[13,793,794,795,815,816,839],{},"Now we just need to calculate the final metric for each candidate. It's\ncalculated as a weighted sum. Remember that we counted the number of elements\non the right and left. The weights will be exactly these counts divided by\nthe total number of elements ",[51,796,798],{"className":797},[54],[56,799,800],{"xmlns":58},[60,801,802,812],{},[63,803,804,808,810],{},[66,805,807],{"mathvariant":806},"normal","∣",[66,809,330],{},[66,811,807],{"mathvariant":806},[70,813,814],{"encoding":72},"|S|",". We have 8 elements in total, so\n",[51,817,819],{"className":818},[54],[56,820,821],{"xmlns":58},[60,822,823,836],{},[63,824,825,827,829,831,833],{},[66,826,807],{"mathvariant":806},[66,828,330],{},[66,830,807],{"mathvariant":806},[91,832,480],{},[482,834,835],{},"8",[70,837,838],{"encoding":72},"|S| = 8",". The final formula:",[13,841,842],{},[51,843,845],{"className":844},[54],[56,846,847],{"xmlns":58},[60,848,849,952],{},[63,850,851,854,856,858,888,902,904,936],{},[66,852,853],{},"M",[91,855,480],{},[619,857,621],{},[606,859,860,880],{},[63,861,862,864,878],{},[91,863,807],{"fence":86},[325,865,866,868],{},[66,867,330],{},[63,869,870,872,874,876],{},[66,871,335],{},[66,873,338],{},[66,875,341],{},[66,877,344],{},[91,879,807],{"fence":86},[63,881,882,884,886],{},[66,883,807],{"mathvariant":806},[66,885,330],{},[66,887,807],{"mathvariant":806},[325,889,890,892],{},[66,891,457],{},[63,893,894,896,898,900],{},[66,895,335],{},[66,897,338],{},[66,899,341],{},[66,901,344],{},[91,903,504],{},[606,905,906,928],{},[63,907,908,910,926],{},[91,909,807],{"fence":86},[325,911,912,914],{},[66,913,330],{},[63,915,916,918,920,922,924],{},[66,917,371],{},[66,919,374],{},[66,921,377],{},[66,923,380],{},[66,925,344],{},[91,927,807],{"fence":86},[63,929,930,932,934],{},[66,931,807],{"mathvariant":806},[66,933,330],{},[66,935,807],{"mathvariant":806},[325,937,938,940],{},[66,939,457],{},[63,941,942,944,946,948,950],{},[66,943,371],{},[66,945,374],{},[66,947,377],{},[66,949,380],{},[66,951,344],{},[70,953,954],{"encoding":72},"M = \\ \\frac{\\left| S_{left} \\right|}{|S|}G_{left} + \\frac{\\left| S_{right} \\right|}{|S|}G_{right}",[13,956,957],{},"Calculating for all our candidates:",[296,959,960,972],{},[299,961,962],{},[302,963,964,968],{},[305,965,966],{},[117,967,309],{},[305,969,970],{},[117,971,853],{},[389,973,974,981,988,995,1001,1007,1013],{},[302,975,976,978],{},[394,977,396],{},[394,979,980],{},"0.36",[302,982,983,985],{},[394,984,407],{},[394,986,987],{},"0.33",[302,989,990,992],{},[394,991,418],{},[394,993,994],{},"0.3",[302,996,997,999],{},[394,998,429],{},[394,1000,653],{},[302,1002,1003,1005],{},[394,1004,439],{},[394,1006,994],{},[302,1008,1009,1011],{},[394,1010,448],{},[394,1012,987],{},[302,1014,1015,1017],{},[394,1016,457],{},[394,1018,980],{},[13,1020,1021],{},"The best metric is for points C and E. We can split at either one, let's\nsplit at C.",[13,1023,1024],{},[189,1025],{"alt":1026,"src":1027},"Best candidate for the first split","/img/blog/ml-basic-3/image10.png",[13,1029,1030],{},"Now let's see what we got. On the left, there's only class 1. This means\nwe have a leaf here and we don't split further. On the right, there are\ndifferent classes. This means it's not a leaf, but another node. We repeat\nthe splitting for this node:",[13,1032,1033],{},[189,1034],{"alt":1035,"src":1036},"Best candidate for the second split","/img/blog/ml-basic-3/image11.png",[13,1038,1039],{},"After calculations, we'll see that we need to split at E. On the left, we'll\nhave class 0. On the right, class 1. That is, all objects are assigned to\nclasses. We add two leaves. The tree is built.",[13,1041,1042],{},"The final algorithm (initially the tree is empty):",[1044,1045,1046,1049,1052,1055,1058],"ol",{},[20,1047,1048],{},"Add a node at the root",[20,1050,1051],{},"Perform a split at the node",[20,1053,1054],{},"If all objects on the left are of the same class, create a leaf,\notherwise create a node",[20,1056,1057],{},"If all objects on the right are of the same class, create a leaf,\notherwise create a node",[20,1059,1060],{},"Repeat steps 2-4 as long as there is at least one node in the tree",[13,1062,1063],{},"Once the tree is built, for each new object we simply traverse it from the\nroot until we reach a leaf. The class in the leaf will be the class for our\nobject.",[43,1065,1067],{"id":1066},"more-dimensions","More Dimensions",[13,1069,1070],{},"We've learned how to build trees for one-dimensional input data, that is,\nfor single numbers. For example, using the tree described in the previous\nchapter, we could try to determine whether we're looking at an apple or a\npear based on a single parameter. For instance, by diameter. However, such\nclassification would be very inaccurate. Diameter alone says almost nothing\nabout whether it's a pear or an apple.",[13,1072,1073],{},[189,1074],{"alt":1075,"src":1076},"Data that cannot be effectively separated in one dimension","/img/blog/ml-basic-3/image12.png",[13,1078,1079],{},"In all the figures below, we'll mark apples in red and pears in blue. From\nthe figure, you can see that it's impossible to split the data into groups\nof more than one element.",[13,1081,1082],{},"The situation changes dramatically if we add a second dimension. Now each\nfruit will be described not by one parameter, but by two: width (or diameter)\nand height. Adding a second parameter will allow us to distinguish apples\nfrom pears much more effectively.",[13,1084,1085],{},[189,1086],{"alt":1087,"src":1088},"Moving data to 2 dimensions","/img/blog/ml-basic-3/image13.png",[13,1090,1091],{},"Notice that nothing has changed along the X axis. But now the data is very\nwell separated along the Y axis.",[13,1093,1094],{},"If a human were doing the classification, they would look at the elongation\n(the difference between height and width). To do something similar with\nmachine learning, we can again use decision trees. We just need to figure\nout how to build a tree for our case.",[13,1096,1097],{},"In fact, the algorithm will be almost the same. At each step, we split the\ndata into two parts. The main difference is that before, we didn't need to\nchoose which parameter to split by, since there was only one. Now there are\ntwo parameters, and first we need to decide which one to use for splitting.\nAfter that, everything reduces to choosing the split point, which we already\nlearned to do in the previous chapter.",[13,1099,1100],{},"What will such splits look like on a graph? Before, we had only one axis,\nand we simply placed a point, to the left of which all objects go to one\nbranch of the tree, and to the right — to another. Now we have two axes.\nAnd the first step is choosing which axis to split by. If we choose the X\naxis, it will be a vertical line; if Y — a horizontal line.",[13,1102,1103],{},[189,1104],{"alt":1105,"src":1106},"Splitting candidate in 2 dimensions","/img/blog/ml-basic-3/image14.png",[13,1108,1109],{},"Orange and green colors show splits along different features (axes).",[13,1111,1112],{},"At the second stage, we choose exactly where such a line will pass. After\nthat, objects on different sides of the line (above/below or left/right)\ngo to different branches of the tree.",[13,1114,1115],{},"It remains to figure out how exactly to choose the axis. In fact, everything\nis very simple. We now iterate not only through splitting candidates for one\nfeature, but for all of them. And we choose the best among all options. And\nwhen at the second step we need to split along the chosen axis, we already\nknow where to do it, since we calculated that at the first step. If we\nvisualize the tree's operation on the graph, we get this result:",[13,1117,1118],{},[189,1119],{"alt":1120,"src":1121},"Splitting result in 2 dimensions","/img/blog/ml-basic-3/image15.png",[13,1123,1124],{},"The final algorithm will look like this:",[1044,1126,1127,1129,1132,1135,1137,1139],{},[20,1128,1048],{},[20,1130,1131],{},"Find the best split for each feature",[20,1133,1134],{},"Perform the split at the node using the best feature",[20,1136,1054],{},[20,1138,1057],{},[20,1140,1141],{},"Repeat steps 2-5 as long as there is at least one node in the tree",[13,1143,1144],{},"If you look at this algorithm, it becomes clear that nothing prevents us\nfrom working with higher-dimensional data. For example, for three-dimensional\ndata, we would separate it not with a line, but with a plane. This can still\nbe visualized. But for four or more dimensions, there is no clear\nvisualization, since the data would be separated by a hyperplane.",[43,1146,1148],{"id":1147},"overfitting-and-what-to-do-about-it","Overfitting and What to Do About It",[13,1150,1151],{},"So far, we've been building the tree until all objects are correctly\nclassified. At first glance, this seems like the only right approach — after\nall, we don't want to make errors. But this is only at first glance.\nPerfectionism in machine learning does more harm than good.",[13,1153,1154],{},"Let's return to the apple and pear example and imagine that a defective\nelongated apple made it into our training set. The decision tree, if left\nas is, will build branches to correctly classify this object as an apple,\neven though it's surrounded by pears. As a result, when we use this tree\nfor classification, there's a risk that a normal pear will be classified as\nan apple.",[13,1156,1157],{},[189,1158],{"alt":1159,"src":1160},"Overfitting in two dimensions","/img/blog/ml-basic-3/image16.png",[13,1162,1163],{},"What can we do about this? We need to make the tree ignore single unusual\nexamples. There are several ways to do this:",[1044,1165,1166,1169,1172],{},[20,1167,1168],{},"Limit the tree depth. For example, we can allow no more than three\nsplits. After three splits, the outermost nodes become leaves.",[20,1170,1171],{},"Limit the minimum number of data points in a leaf. For example, if a\nnode has four or fewer objects, it automatically becomes a leaf and\nis not split further.",[20,1173,1174],{},"Limit the total number of leaves. If the limit is reached, no new\nsplits occur, and all outermost nodes become leaves.",[13,1176,1177],{},"Each of these methods can help combat overfitting. But there is no universal\nrecipe. Usually, a combination of methods is applied, and parameters are\ntuned experimentally.",[13,1179,1180],{},"For such experiments (and not only for decision trees, but for many other\nmachine learning models as well), having a test set is very useful. What is\nit? Before training begins, we set aside 10-20 percent of the data as the\ntest set. And we don't use it during training. This is a very important\npoint — the test data must never be seen by the model during training.",[13,1182,1183],{},"When training is complete, we run the test data through the model and see\nhow well it performs. If the results on the training data are very good but\npoor on the test data, this is most likely overfitting, and measures need\nto be taken. Usually, the goal is for the percentage of correct answers on\nthe training and test sets to be roughly the same.",[1185,1186,1188,1207,1229,1342,1355,1368,1381,1394],"faq",{"title":1187},"Questions and Answers",[1189,1190,1192,1199],"faq-item",{"value":1191},"item-1",[1193,1194,1196],"template",{"v-slot:question":1195},"",[13,1197,1198],{},"How does classification differ from regression?",[1193,1200,1201],{"v-slot:answer":1195},[13,1202,1203,1204,1206],{},"In regression, the model predicts a continuous number from an arbitrary\nrange (for example, exact weight in kilograms). In classification, the\nmodel chooses from predefined options — ",[117,1205,123],{},". If there are only\ntwo classes, it's binary classification (a \"yes\" or \"no\" answer).",[1189,1208,1210,1215],{"value":1209},"item-2",[1193,1211,1212],{"v-slot:question":1195},[13,1213,1214],{},"What is a decision tree and what does it consist of?",[1193,1216,1217],{"v-slot:answer":1195},[13,1218,1219,1220,1222,1223,1225,1226,1228],{},"It's an algorithm that represents a sequence of checks. A tree consists\nof three types of elements: a ",[117,1221,270],{}," — the starting point where the\nalgorithm begins; ",[117,1224,278],{}," — intermediate points where the data is split\ninto two parts; ",[117,1227,274],{}," — endpoints that contain the final class. To\nclassify a new object, you traverse from the root through the nodes,\nperforming checks, until you reach a leaf.",[1189,1230,1232,1237],{"value":1231},"item-3",[1193,1233,1234],{"v-slot:question":1195},[13,1235,1236],{},"What is the Gini criterion and why is it needed?",[1193,1238,1239],{"v-slot:answer":1195},[13,1240,1241,1242,1285,1286,540,1304,1322,1323,1341],{},"The Gini criterion shows how \"pure\" the data split is at a node. It's\ncalculated using the formula\n",[51,1243,1245],{"className":1244},[54],[56,1246,1247],{"xmlns":58},[60,1248,1249,1282],{},[63,1250,1251,1253,1255,1257,1259,1262,1270,1272,1280],{},[66,1252,457],{},[91,1254,480],{},[482,1256,399],{},[91,1258,486],{},[91,1260,491],{"stretchy":1261},"false",[493,1263,1264,1266,1268],{},[66,1265,13],{},[482,1267,499],{},[482,1269,410],{},[91,1271,504],{},[493,1273,1274,1276,1278],{},[66,1275,13],{},[482,1277,399],{},[482,1279,410],{},[91,1281,515],{"stretchy":1261},[70,1283,1284],{"encoding":72},"G = 1 - (p_0^2 + p_1^2)",", where ",[51,1287,1289],{"className":1288},[54],[56,1290,1291],{"xmlns":58},[60,1292,1293,1301],{},[63,1294,1295],{},[325,1296,1297,1299],{},[66,1298,13],{},[482,1300,499],{},[70,1302,1303],{"encoding":72},"p_0",[51,1305,1307],{"className":1306},[54],[56,1308,1309],{"xmlns":58},[60,1310,1311,1319],{},[63,1312,1313],{},[325,1314,1315,1317],{},[66,1316,13],{},[482,1318,399],{},[70,1320,1321],{"encoding":72},"p_1"," are the proportions\nof classes 0 and 1 among the objects. The lower the value, the better\nthe split: ",[51,1324,1326],{"className":1325},[54],[56,1327,1328],{"xmlns":58},[60,1329,1330,1338],{},[63,1331,1332,1334,1336],{},[66,1333,457],{},[91,1335,480],{},[482,1337,499],{},[70,1339,1340],{"encoding":72},"G = 0"," means all objects on one side belong to the same class.",[1189,1343,1345,1350],{"value":1344},"item-4",[1193,1346,1347],{"v-slot:question":1195},[13,1348,1349],{},"Why can't we just choose the split with the maximum number of correct\nanswers?",[1193,1351,1352],{"v-slot:answer":1195},[13,1353,1354],{},"This approach is called a greedy algorithm — it optimizes only the\ncurrent step without considering subsequent ones. The best split at the\nfirst stage may lead to poor quality at later stages. That's why a\nweighted sum of Gini criteria is used: it accounts not only for the\n\"purity\" of the split but also for the number of objects on each side.",[1189,1356,1358,1363],{"value":1357},"item-5",[1193,1359,1360],{"v-slot:question":1195},[13,1361,1362],{},"How does a decision tree work with data that has more than one feature?",[1193,1364,1365],{"v-slot:answer":1195},[13,1366,1367],{},"The algorithm is almost the same. The only difference: at each step, we\nneed to not only choose a split point but also determine which feature\nto split by. For two features, a split looks like a vertical or\nhorizontal line on a graph. For three features — a plane, and for more\n— a hyperplane. Candidates are evaluated across all features, and the\nbest option is selected.",[1189,1369,1371,1376],{"value":1370},"item-6",[1193,1372,1373],{"v-slot:question":1195},[13,1374,1375],{},"What is overfitting and why is it dangerous?",[1193,1377,1378],{"v-slot:answer":1195},[13,1379,1380],{},"Overfitting occurs when a model adapts too closely to the training data,\nincluding noise and outliers. For example, if an atypical object appears\nin the training set, the tree will create separate branches to classify\nit. As a result, the model will make errors on new data because it has\nlearned the peculiarities of a specific dataset rather than the general\npattern.",[1189,1382,1384,1389],{"value":1383},"item-7",[1193,1385,1386],{"v-slot:question":1195},[13,1387,1388],{},"What methods exist to combat overfitting in decision trees?",[1193,1390,1391],{"v-slot:answer":1195},[13,1392,1393],{},"The main approaches are: limiting tree depth (maximum number of splits),\nlimiting the minimum number of objects in a leaf (a node with few\nelements is not split further), and limiting the total number of leaves.\nIn practice, several methods are usually combined, and parameters are\ntuned experimentally using a test set.",[1189,1395,1397,1402],{"value":1396},"item-8",[1193,1398,1399],{"v-slot:question":1195},[13,1400,1401],{},"Why is a test set needed?",[1193,1403,1404],{"v-slot:answer":1195},[13,1405,1406],{},"A test set is a portion of the data (usually 10–20%) that the model\nnever sees during training. After training, we evaluate the model's\nquality on this data. If the model performs well on the training data\nbut poorly on the test data, this is a sign of overfitting. Ideally,\nthe quality on both sets should be roughly the same.",{"title":1195,"searchDepth":1408,"depth":1408,"links":1409},2,[1410,1411,1412,1413,1414],{"id":45,"depth":1408,"text":46},{"id":147,"depth":1408,"text":148},{"id":225,"depth":1408,"text":226},{"id":1066,"depth":1408,"text":1067},{"id":1147,"depth":1408,"text":1148},"mdx",{"readTime":1417,"image":1418,"date":1419,"tags":1420,"authors":1423},"13 minutes","/img/blog/ml-basic-3/preview.png","2026-05-15",[1421,1422],"Artificial Intelligence","Machine Learning",[1424],"vgorash",true,{"title":1427,"description":1428},"What Do Trees Think About?","In this article, we will start exploring the classification problem. And the first algorithm will be the most intuitively understandable one — decision trees.","en/blog/ml-basic-3","aI6iZL_1crrMP1OJPukD83Wb4Y9RByDfpvqIoGhC_JA",1780489474885]