Explanation of Various Statistical models in Machine Learning

Raghav Agarwal
17 min readJun 1, 2021

--

This blog covers explanation of statistical models listed below -

  1. ML Models
  • KNN
  • Naïve Bayes
  • Logistic Regression
  • Linear Regression
  • SVM
  • Decision Trees

2. Ensembles

  • Bagging and Random Forrest Classifier
  • Boosting and Gradient Boosting Classifier

Important Note : Many information are not complete in its form , this blog provides a mathematical and intuition of the models.

Understanding Machine Learning

Looking from a 1000 feet view , a ML model can be described as an algorithm or an optimization problem to solve a problem statement related to Machine Learning.

What is a Machine Learning problem and how these models help in solving a ML problem?

A machine learning problem is a kind of a problem which demands answers from an algorithm without being explicitly programmed and a ML model is designed in such a way , that it learns the behavior of the data it gets to train on and predict the answer for new data points.

A machine Learning problem can be categorized into two types broadly -

  • Supervised Learning
  • Unsupervised Learning
Types of ML learning and approaches to solve

This blog covers Supervised learning as all models listed above falls under this approach, in simple terms a Supervised Learning is when we know the labels of data and which class they belong too and Unsupervised is simply the opposite, there are examples where supervised and unsupervised techniques are followed simultaneously to solve a problem statement.

Diagrammatic Representation

Note : I am making all these diagrams using simple paint , any confusion do search it over and get your answer.

As you can see in the diagram in supervised the data can have subgroups of red and blue points but in unsupervised learning this information is missing.

In Supervised Learning as shown in the figure there are broadly two types of problems-

  • Classification

Given a data and its labels , for any new data point which class or subset of label will that point lie in is the task

  • Regression

Given a data , make a function such that it fits the data nicely and can give a definite answerer for any new point.

Rather than statements lets look at fun way of understanding these types

Lets say we have a Venn diagram something like this:

We have two super humans one a classifier and one regressor, you rely on classifier to tell you which food you want to eat at which time , for that it will throw a dart at this Venn Diagram for every dart a regressor can tell how fast that dart went to the Venn diagram. Here classifier has met lot of people like you and have learnt your type of behavior and Regressor has met lot of people like classifier and have made itself capable of predicting the speed.

Hence in graph terminology

Now , understand when it was pointed that classifier has met lot of people like you and understand your behavior towards the food choices at that time it was training itself to know and give the classification answer, same follows with regressor ,it has met lot people like classifier and their dart throwing skills and is efficient to guess the speed of dart thrown by these people.

Set form

This is how very intuitively a Machine learning algorithms creates models and is able to learn from data without being explicitly programmed.

Note : Training includes concepts like Hyperparameter Tuning , Solving a optimization problem , finding maxima or minima , which are not discussed in this blog.

Machine Learning Models

Now after understanding what a machine learning problem and a machine learning algorithm is lets start with understanding the models for such tasks.

K-NN / K-Nearest Neighbors

A K-NN model or algorithm is a type of a parametric technique to actually use “K” neighbors of the point and classify the point by max of aggregation.

Graphical Understanding

Lets say we have a data something like this

According to K-NN , it will look at K nearest neighbor of the data point and will define or classify the point as per the max number of labels present in the vicinity of it , corresponding to the value of K

In the figure “Green point” 7-NN are all red points hence for K-NN it will be classified as a red point ,similarly for “brown point” 7-NN are all blue points hence it will be classified as blue point , for “yellow point” , 7-NN are 6 red and 1 blue which will be classified as Red point.

AlgorithmInput D = {x , y} , y = {0,1} , Kfor x in D:
neighbors = get_k_neighbours(x , D , K)

cnt_0 = 0
cnt_1 = 0

for y in neighbors:
if y == 1:
cnt_1 += 1
else
cnt_0 += 1
if cnt_1 > cnt_0:
return 1
else:
return 0

The above algorithm is for K-NN for binary classification when y = {0,1} but in multi-class classification, just the counting method in the neighbors loop will have more conditions rest remains the same. Talking about get_k_neighbors() function , there are many types of distance used in K-NN theory to define neighbors of the point.

Types of distances -

Euclidian Distance

Manhattan Distance

Cosine Distance

There are many more , the above three are the most commonly used in practice.

Ways to find K-nearest neighbors,

The brute force algorithm above has a time complexity of O(nd) , which is same for training and inference time, there are other methods also to find K-NN in more effective way , such as K-D tree , LSH , and others , mainly in practice the inbuilt libraries judge the data and then apply the efficient algorithm for the computation of K-NN.

Snippet from sklearn Knn-documentation

Covering KD Tree and LSH , will be little out of scope for this blog ,because going into them will require understanding of training and testing , which will then , make room for understanding why K-D tree and LSH , for now , understand that , it is due to time complexity , we go for different approaches for finding K-NN other than brute force.

Failure Cases of K-NN

KNN will not perform better every time,

For the first and the second image, the data point is away from both the classes , and will have conflicting results with “K” variable in K-means.

Curse of Dimensionality

Curse of dimensionality indicates that when the dimensionality increases number of points increases exponentially.

Taking Euclidean distance as a distance measure for 1d , 2d , 3d,

but when dimensionality increases , this equation tends to “0” , which will make dist-max and dist-min equal , so now , K-NN will be no use on this data. A simple solution to this problem “can” be cosine distance , but it is preferred to be careful with K-NN for high dimensionality data.

KNN for Classification and Regression

Evidently , KNN can be used for both classification and regression , the above methodology defines how it can be used as a classification algorithm , for regression problem , after finding K-NN of a point rather than going for max go for mean/median of all the points, which will easily convert its motive from classification to regression, in practice many models with little tweak can always be used as both classification and Regression models.

Naive Bayes Model

Naive Bayes Classifier , is an approach to solve an ML problem using Naive Bayes Algorithm , forming a optimization problem using it , and solving for each label in the data.

The above definition will make sense , lets first understand Bayes Theorem approach,

According to conditional probability ,

For Naive Bayes ,

from (1) and (2),

Converting this to a ML related problem,

x is n-dimensional, hence,

Note : Correction in recursively it should be P(Ck) and not P(Ck / x) after “=”.

The main assumption which a Naive Bayes model makes is the conditional independence between the features xi where “i” goes from 1 to n, in practice this is not generally true hence this assumption makes the NB model go for less predictable power , whereas Naive Bayes classifier is “the threshold” algorithm for the task of Text classification.

Naive Bayes for Text Classification

Laplace Smoothing

For inference time , Naive Bayes classifier actually uses train data corpus to get the likelihoods probability , but in many cases if the new data point has some new xi , which is not present in train data ,the whole prediction equals to 0 to avoid this we add a value called “alpha” in the denominator , “#classes * alpha” , this is called Laplace smoothing.

Interesting Thing about distribution of likelihoods of Naive-Bayes Classifier

Likelihoods are the conditional probability of xi present given label, in Laplace smoothing if we actually increase the alpha values the distribution of these likelihoods tend to start behaving as Gaussian Distribution.

Why Laplace “smoothing”

This brings in the question of another assumption of assuming a predefined Distribution of likelihoods, and get the probability using probability distribution function of the distribution , hence we see many types of Naive-Bayes Classifier , i.e. Gaussian Naive Bayes , Multinomial Naive Bayes , Bernoulli Naive Bayes , etc.

(Link to Sklearn doc of Naive Bayes)

Logistic Regression

The Logistic Regression model attempts to draw a decision boundary between the points and classify the points on the basis of decision boundary generated.

The assumption Logistic Regression model makes is that the data is linearly separated , hence it draws a decision boundary on it. Using this Decision boundary , every point is classified into labels

#for the diagramif W.T * xi is +ve:
point is "RED"
else:
point is "BLUE"

Considering this lets try to make a optimization problem of this , an optimization problem is a problem to find the best solution to all the feasible solutions, i.e , in a way its a heuristic approach to solve a problem, understanding the formation of optimization problem will make the understanding of Logistic Regression very thorough and logical.

#for yi = 1 , W.T * xi = +veHnece yi*(W.T * xi) is +ve#for yi = -1, W.T * xi = -veHnece yi*(W.T * xi) is +ve

So , for all the points in the data , the product of (W.T * xi) * yi should be max, this theory makes our simple optimization problem for Logistic Regression.

The problem with this optimization problem is that this optimization problem is prone to outliers , the decision boundary can vary very significantly just with a single outlier , and this happens when the outlier has a huge value, to remove this , we need a function which can keep the small value small and squash the large values to a particular range , hence we use sigmoid function.

Sigmoid function

With this , we also introduce monotonicity to this optimization problem using log, a monotonic function is a function which doesn't change the behavior of the variable distribution and log is a best example of it.

Log Function

If x is increasing , log(x) will also be increasing , this makes log a monotonic function. Combining all this we can reach the optimization problem of the logistic regression model.

Solving this optimization problem using Lagrange Multiplier will actually solve the problem of the Logistic Regression and will give the weight vector “W” , through which a decision boundary is made.

Significance of the weight vector

Weight vector we get from our optimization problem is the decision boundary we go for ,

Probabilistic InterpretationW = < W1 , W2 , .......... , Wd > is a weight vectorif Wi is +ve -> Wi * xi is +ve -> Probability of y'i of 1 is moreif Wi is -ve -> Wi * xi is -ve -> Probability of y'i of -1 is more

So , for Fi in feature vector , if the corresponding Wi is positive , Probability of y’i of 1 is more and visa versa. This is the significance and relation between the weight vector and feature vector.

Note : After understanding the optimization problem , it is important to understand “why regularization” ,”types of regularization” , “how to solve an optimization problem” , “Hyper parameter search” which can answer the questions of how to get the weight vector, this blog talks about the concepts in each model , hence not going for explanation in these topics.

Linear Regression

A linear regression model is a model which tends to fit the data using a line/plane.

Prediction = W.T * xi + b , where W is the unit direction vector of the plane.

Using the same approach as logistic regression , lets form the optimization problem,

Looking at the diagram above , it is seen that , for all the red points , the regression plane gave error in three points from all the data points , which is actually the plane which will give minimum errors in regression for this data. This theory of minimizing this error in whole train data makes the optimization problem of Linear Regression.

The error we minimize in this problem is “mean-square-error” which is actually very visible in the optimization problem.

SVM ( Support Vector Machines)

SVM or support vector machines are type of models which follow the approach of a “large margin classifier” , i.e. , it maximizes the difference between the +ve plane and -ve plane for the decision plane.

Using the margin maximization approach , we can actually form the optimization equation for SVM.

The diagram show the equation of pi , pi+ve and pi-ve planes, all the planes lying on the +ve and -ve plane will have yi * (W.T * X) = 1 and all the points beyond them will have yi * (W.T * X) > 1 , hence our optimization reads that construct a margin such that yi * (W.T * X) ≥ 1 , with maximizing the margin gap between the planes.

The optimization problem constructed above is a hard margin problem , if the data is not linearly separable , the optimization problem will not be able make the margin classifier.

Introducing Hinge loss in the optimization problem

First thing to note in the diagram is that data is not linearly separable , which ever point which are wrongly classified , there value of loss is increasing , looking at this behavior , here comes the introduction of a “Zeta” variable.

Zeta = 0 , if yi * (W.T * X) >= 1Zeta > 0 , if yi * (W.T * X) < 1 

Intuitively , “Zeta” is telling if point is in the correct side plane or not ,and if not, how wrong is it , and for our margin classifier , this should be minimum ,hence we introduce “Hinge loss” in the optimization problem.

Using this optimization problem , SVM can fit non linearly separable data, in the optimization problem “C” refers to the learning rate of the SVM and act as a “Hyperparameter” .

Loss-Minimization of Hinge Loss

Let yi * (W.T * xi + b) = zi
[ zi > 0 : x is correctly classified
zi < 0 : x is incorrectly classified
]
Hinge - Loss
{
zi >= 1 : hinge-loss = 0
zi < 1 : hinge-loss = 1 - zi
}
Therefore,
hinge-loss = max(0 , 1-zi)
= max(0 , 1 - yi*(W.T*xi + b))

Both , Soft Margin problem and Loss minimization problem solve the same linear problem and are equal when removing the constants, the loss minimization is preferred when the learning rate is needed in the regularizer ,hence both the optimization problems are equal.

Note : There is a dual form of SVM , which has an interesting property of Kernelization , which allows SVM to fit for non linear data , in this blog ,the dual form is not discussed.

Decision Trees

The Decision Trees is what it is named , it makes an “if-else” statement tree for the whole data and do the task.

How the Decision Tree decides and make the tree given data ?

To answer this question , we have to understand some more theory related to Decision Trees,

Entropy

Entropy controls how a Decision Tree decides to split the data. It actually effects how a Decision Tree draws its boundaries.

Intuition on Entropy

For balanced data the Entropy is always high , where as for imbalanced data the entropy is always low. Intuitively Entropy tells the randomness of the data present in a subtle way as in for each probability Kernel how does Label values change, hence a uniform distribution has the highest Entropy , as for a throw of dice every throw results in probability of (1/6) but with different answer.

Probability Distribution Function in Entropy

Kullback Liblier (KL) divergence

Information Gain

It is a synonym for Kullback–Libeler divergence; the amount of information gained about a random variable or signal from observing another random variable. However, in the context of decision trees, the term is sometimes used synonymously with mutual information, which is the conditional expected value of the Kullback–Leibler divergence of the univariate probability distribution of one variable from the conditional distribution of this variable given the other one.

Gini Impurity

Gini Impurity is the probability of incorrectly classifying a randomly chosen element in the dataset if it were randomly labeled according to the class distribution in the dataset.

Gini impurity behaves similar to Entropy , and has less time computation, hence it can be a substitute to Entropy.

Constructing a Decision Tree

The classic example for making a decision tree is given below

This is the data we have for making the Decision Tree,algorithm goes as follows,

Load learning sets  and create decision tree root node(rootNode), add learning set S into root not as its subset
For rootNode, compute Entropy(rootNode.subset) first
If Entropy(rootNode.subset) == 0 (subset is homogenious)
return a leaf node

If Entropy(rootNode.subset)!= 0 (subset is not homogenious)
compute Information Gain for each attribute left (not been used for spliting)
Find attibute A with Maximum(Gain(S, A))
Create child nodes for this root node and add to rootNode in the decision tree

For each child of the rootNode
Apply ID3(S, A, V)
Continue until a node with Entropy close to 0 or a leaf node is reached

(Note : Understanding of the algorithm)

Following the algorithm the decision tree for this problem looks like this

2. Ensembles

Ensembles resembles an idea of using multiple models to create a more powerful model for the task.

Bagging and Random Forrest Classifier

In Bagging , we split the data in multiple Bootstrap samples , with repetition and each data a model is trained , in the end stage , the aggregation of all the models output is taken and the task is done.

Now if we change the data in Dn , only some sample with some models will be affected , which will not make the overall model loose its predictability power, also it reduces variance in the models without impacting the bias.

Random Forrest Classifier

Random Forrest Classifier : Decision Tress ( Base-Learners) + 
Bagging(row-bagging) +
Column Sampling(column-bagging)

Data sample is generated by selecting examples from the data randomly with replacement. This data sample is used to learn a decision tree. Further, while choosing the next feature for classification, only a subset of randomly selected features is considered. An ensemble model is produced by using all decision trees generated.

Boosting and Gradient Boosting Classifier

Boosting tends to reduce bias and keep variance stable for the base learners in the ensemble

Boosting : Low Var & High bias + additively combine

Algorithm for boosting

Source

Form a large set of simple featuresInitialize weights for training imagesFor T rounds    Normalize the weights    For available features from the set, train a classifier using a   single feature and evaluate the training error    Choose the classifier with the lowest error    Update the weights of the training images: increase if classified wrongly by this classifier, decrease if correctlyForm the final strong classifier as the linear combination of the T classifiers (coefficient larger if training error is small)

Gradient Boosting Classifier

In gradient boosting , the residual error is just replaced with negative gradient of the loss function, as in if we have a differentiable loss function and a data set with high bias , applying Gradient boosting is the best choice.

Residual and Negative Gradient

Algorithm(Source):

Summary

In summary , it is important to understand how these statistical models differ from each other in terms of methodology and assumption making and what is the del factor between them for the usage in practical real time use.

Hope this blog gave a mathematical and intuitive understanding of these models.

Thank you.

LinkedIn : https://www.linkedin.com/in/raghav-agarwal-127539170

--

--

Raghav Agarwal

Data Scientist with experience at Microsoft, Cummins, and George Washington University.