Explanation of Various Statistical models in Machine Learning
This blog covers explanation of statistical models listed below -
- ML Models
- KNN
- Naïve Bayes
- Logistic Regression
- Linear Regression
- SVM
- Decision Trees
2. Ensembles
- Bagging and Random Forrest Classifier
- Boosting and Gradient Boosting Classifier
Important Note : Many information are not complete in its form , this blog provides a mathematical and intuition of the models.
Understanding Machine Learning
Looking from a 1000 feet view , a ML model can be described as an algorithm or an optimization problem to solve a problem statement related to Machine Learning.
What is a Machine Learning problem and how these models help in solving a ML problem?
A machine learning problem is a kind of a problem which demands answers from an algorithm without being explicitly programmed and a ML model is designed in such a way , that it learns the behavior of the data it gets to train on and predict the answer for new data points.
A machine Learning problem can be categorized into two types broadly -
- Supervised Learning
- Unsupervised Learning
This blog covers Supervised learning as all models listed above falls under this approach, in simple terms a Supervised Learning is when we know the labels of data and which class they belong too and Unsupervised is simply the opposite, there are examples where supervised and unsupervised techniques are followed simultaneously to solve a problem statement.
Note : I am making all these diagrams using simple paint , any confusion do search it over and get your answer.
As you can see in the diagram in supervised the data can have subgroups of red and blue points but in unsupervised learning this information is missing.
In Supervised Learning as shown in the figure there are broadly two types of problems-
- Classification
Given a data and its labels , for any new data point which class or subset of label will that point lie in is the task
- Regression
Given a data , make a function such that it fits the data nicely and can give a definite answerer for any new point.
Rather than statements lets look at fun way of understanding these types
Lets say we have a Venn diagram something like this:
We have two super humans one a classifier and one regressor, you rely on classifier to tell you which food you want to eat at which time , for that it will throw a dart at this Venn Diagram for every dart a regressor can tell how fast that dart went to the Venn diagram. Here classifier has met lot of people like you and have learnt your type of behavior and Regressor has met lot of people like classifier and have made itself capable of predicting the speed.
Hence in graph terminology
Now , understand when it was pointed that classifier has met lot of people like you and understand your behavior towards the food choices at that time it was training itself to know and give the classification answer, same follows with regressor ,it has met lot people like classifier and their dart throwing skills and is efficient to guess the speed of dart thrown by these people.
This is how very intuitively a Machine learning algorithms creates models and is able to learn from data without being explicitly programmed.
Note : Training includes concepts like Hyperparameter Tuning , Solving a optimization problem , finding maxima or minima , which are not discussed in this blog.
Machine Learning Models
Now after understanding what a machine learning problem and a machine learning algorithm is lets start with understanding the models for such tasks.
K-NN / K-Nearest Neighbors
A K-NN model or algorithm is a type of a parametric technique to actually use “K” neighbors of the point and classify the point by max of aggregation.
Graphical Understanding
Lets say we have a data something like this
According to K-NN , it will look at K nearest neighbor of the data point and will define or classify the point as per the max number of labels present in the vicinity of it , corresponding to the value of K
In the figure “Green point” 7-NN are all red points hence for K-NN it will be classified as a red point ,similarly for “brown point” 7-NN are all blue points hence it will be classified as blue point , for “yellow point” , 7-NN are 6 red and 1 blue which will be classified as Red point.
AlgorithmInput D = {x , y} , y = {0,1} , Kfor x in D:
neighbors = get_k_neighbours(x , D , K)
cnt_0 = 0
cnt_1 = 0
for y in neighbors:
if y == 1:
cnt_1 += 1
else
cnt_0 += 1 if cnt_1 > cnt_0:
return 1
else:
return 0
The above algorithm is for K-NN for binary classification when y = {0,1} but in multi-class classification, just the counting method in the neighbors loop will have more conditions rest remains the same. Talking about get_k_neighbors() function , there are many types of distance used in K-NN theory to define neighbors of the point.
Types of distances -
Euclidian Distance
Manhattan Distance
Cosine Distance
There are many more , the above three are the most commonly used in practice.
Ways to find K-nearest neighbors,
The brute force algorithm above has a time complexity of O(nd) , which is same for training and inference time, there are other methods also to find K-NN in more effective way , such as K-D tree , LSH , and others , mainly in practice the inbuilt libraries judge the data and then apply the efficient algorithm for the computation of K-NN.
Covering KD Tree and LSH , will be little out of scope for this blog ,because going into them will require understanding of training and testing , which will then , make room for understanding why K-D tree and LSH , for now , understand that , it is due to time complexity , we go for different approaches for finding K-NN other than brute force.
Failure Cases of K-NN
KNN will not perform better every time,
For the first and the second image, the data point is away from both the classes , and will have conflicting results with “K” variable in K-means.
Curse of Dimensionality
Curse of dimensionality indicates that when the dimensionality increases number of points increases exponentially.
Taking Euclidean distance as a distance measure for 1d , 2d , 3d,
but when dimensionality increases , this equation tends to “0” , which will make dist-max and dist-min equal , so now , K-NN will be no use on this data. A simple solution to this problem “can” be cosine distance , but it is preferred to be careful with K-NN for high dimensionality data.
KNN for Classification and Regression
Evidently , KNN can be used for both classification and regression , the above methodology defines how it can be used as a classification algorithm , for regression problem , after finding K-NN of a point rather than going for max go for mean/median of all the points, which will easily convert its motive from classification to regression, in practice many models with little tweak can always be used as both classification and Regression models.
Naive Bayes Model
Naive Bayes Classifier , is an approach to solve an ML problem using Naive Bayes Algorithm , forming a optimization problem using it , and solving for each label in the data.
The above definition will make sense , lets first understand Bayes Theorem approach,
According to conditional probability ,
For Naive Bayes ,
from (1) and (2),
Converting this to a ML related problem,
x is n-dimensional, hence,
Note : Correction in recursively it should be P(Ck) and not P(Ck / x) after “=”.
The main assumption which a Naive Bayes model makes is the conditional independence between the features xi where “i” goes from 1 to n, in practice this is not generally true hence this assumption makes the NB model go for less predictable power , whereas Naive Bayes classifier is “the threshold” algorithm for the task of Text classification.
Naive Bayes for Text Classification
Laplace Smoothing
For inference time , Naive Bayes classifier actually uses train data corpus to get the likelihoods probability , but in many cases if the new data point has some new xi , which is not present in train data ,the whole prediction equals to 0 to avoid this we add a value called “alpha” in the denominator , “#classes * alpha” , this is called Laplace smoothing.
Interesting Thing about distribution of likelihoods of Naive-Bayes Classifier
Likelihoods are the conditional probability of xi present given label, in Laplace smoothing if we actually increase the alpha values the distribution of these likelihoods tend to start behaving as Gaussian Distribution.
This brings in the question of another assumption of assuming a predefined Distribution of likelihoods, and get the probability using probability distribution function of the distribution , hence we see many types of Naive-Bayes Classifier , i.e. Gaussian Naive Bayes , Multinomial Naive Bayes , Bernoulli Naive Bayes , etc.
(Link to Sklearn doc of Naive Bayes)
Logistic Regression
The Logistic Regression model attempts to draw a decision boundary between the points and classify the points on the basis of decision boundary generated.
The assumption Logistic Regression model makes is that the data is linearly separated , hence it draws a decision boundary on it. Using this Decision boundary , every point is classified into labels
#for the diagramif W.T * xi is +ve:
point is "RED"
else:
point is "BLUE"
Considering this lets try to make a optimization problem of this , an optimization problem is a problem to find the best solution to all the feasible solutions, i.e , in a way its a heuristic approach to solve a problem, understanding the formation of optimization problem will make the understanding of Logistic Regression very thorough and logical.
#for yi = 1 , W.T * xi = +veHnece yi*(W.T * xi) is +ve#for yi = -1, W.T * xi = -veHnece yi*(W.T * xi) is +ve
So , for all the points in the data , the product of (W.T * xi) * yi should be max, this theory makes our simple optimization problem for Logistic Regression.
The problem with this optimization problem is that this optimization problem is prone to outliers , the decision boundary can vary very significantly just with a single outlier , and this happens when the outlier has a huge value, to remove this , we need a function which can keep the small value small and squash the large values to a particular range , hence we use sigmoid function.
With this , we also introduce monotonicity to this optimization problem using log, a monotonic function is a function which doesn't change the behavior of the variable distribution and log is a best example of it.
If x is increasing , log(x) will also be increasing , this makes log a monotonic function. Combining all this we can reach the optimization problem of the logistic regression model.
Solving this optimization problem using Lagrange Multiplier will actually solve the problem of the Logistic Regression and will give the weight vector “W” , through which a decision boundary is made.
Significance of the weight vector
Weight vector we get from our optimization problem is the decision boundary we go for ,
Probabilistic InterpretationW = < W1 , W2 , .......... , Wd > is a weight vectorif Wi is +ve -> Wi * xi is +ve -> Probability of y'i of 1 is moreif Wi is -ve -> Wi * xi is -ve -> Probability of y'i of -1 is more
So , for Fi in feature vector , if the corresponding Wi is positive , Probability of y’i of 1 is more and visa versa. This is the significance and relation between the weight vector and feature vector.
Note : After understanding the optimization problem , it is important to understand “why regularization” ,”types of regularization” , “how to solve an optimization problem” , “Hyper parameter search” which can answer the questions of how to get the weight vector, this blog talks about the concepts in each model , hence not going for explanation in these topics.
Linear Regression
A linear regression model is a model which tends to fit the data using a line/plane.
Prediction = W.T * xi + b , where W is the unit direction vector of the plane.
Using the same approach as logistic regression , lets form the optimization problem,
Looking at the diagram above , it is seen that , for all the red points , the regression plane gave error in three points from all the data points , which is actually the plane which will give minimum errors in regression for this data. This theory of minimizing this error in whole train data makes the optimization problem of Linear Regression.
The error we minimize in this problem is “mean-square-error” which is actually very visible in the optimization problem.
SVM ( Support Vector Machines)
SVM or support vector machines are type of models which follow the approach of a “large margin classifier” , i.e. , it maximizes the difference between the +ve plane and -ve plane for the decision plane.
Using the margin maximization approach , we can actually form the optimization equation for SVM.
The diagram show the equation of pi , pi+ve and pi-ve planes, all the planes lying on the +ve and -ve plane will have yi * (W.T * X) = 1 and all the points beyond them will have yi * (W.T * X) > 1 , hence our optimization reads that construct a margin such that yi * (W.T * X) ≥ 1 , with maximizing the margin gap between the planes.
The optimization problem constructed above is a hard margin problem , if the data is not linearly separable , the optimization problem will not be able make the margin classifier.
Introducing Hinge loss in the optimization problem
First thing to note in the diagram is that data is not linearly separable , which ever point which are wrongly classified , there value of loss is increasing , looking at this behavior , here comes the introduction of a “Zeta” variable.
Zeta = 0 , if yi * (W.T * X) >= 1Zeta > 0 , if yi * (W.T * X) < 1
Intuitively , “Zeta” is telling if point is in the correct side plane or not ,and if not, how wrong is it , and for our margin classifier , this should be minimum ,hence we introduce “Hinge loss” in the optimization problem.
Using this optimization problem , SVM can fit non linearly separable data, in the optimization problem “C” refers to the learning rate of the SVM and act as a “Hyperparameter” .
Loss-Minimization of Hinge Loss
Let yi * (W.T * xi + b) = zi
[ zi > 0 : x is correctly classified
zi < 0 : x is incorrectly classified
]Hinge - Loss
{
zi >= 1 : hinge-loss = 0
zi < 1 : hinge-loss = 1 - zi
}Therefore,
hinge-loss = max(0 , 1-zi)
= max(0 , 1 - yi*(W.T*xi + b))
Both , Soft Margin problem and Loss minimization problem solve the same linear problem and are equal when removing the constants, the loss minimization is preferred when the learning rate is needed in the regularizer ,hence both the optimization problems are equal.
Note : There is a dual form of SVM , which has an interesting property of Kernelization , which allows SVM to fit for non linear data , in this blog ,the dual form is not discussed.
Decision Trees
The Decision Trees is what it is named , it makes an “if-else” statement tree for the whole data and do the task.
How the Decision Tree decides and make the tree given data ?
To answer this question , we have to understand some more theory related to Decision Trees,
Entropy
Entropy controls how a Decision Tree decides to split the data. It actually effects how a Decision Tree draws its boundaries.
Intuition on Entropy
For balanced data the Entropy is always high , where as for imbalanced data the entropy is always low. Intuitively Entropy tells the randomness of the data present in a subtle way as in for each probability Kernel how does Label values change, hence a uniform distribution has the highest Entropy , as for a throw of dice every throw results in probability of (1/6) but with different answer.
Probability Distribution Function in Entropy
Kullback Liblier (KL) divergence
Information Gain
It is a synonym for Kullback–Libeler divergence; the amount of information gained about a random variable or signal from observing another random variable. However, in the context of decision trees, the term is sometimes used synonymously with mutual information, which is the conditional expected value of the Kullback–Leibler divergence of the univariate probability distribution of one variable from the conditional distribution of this variable given the other one.
Gini Impurity
Gini Impurity is the probability of incorrectly classifying a randomly chosen element in the dataset if it were randomly labeled according to the class distribution in the dataset.
Gini impurity behaves similar to Entropy , and has less time computation, hence it can be a substitute to Entropy.
Constructing a Decision Tree
The classic example for making a decision tree is given below
This is the data we have for making the Decision Tree,algorithm goes as follows,
Load learning sets and create decision tree root node(rootNode), add learning set S into root not as its subset
For rootNode, compute Entropy(rootNode.subset) first
If Entropy(rootNode.subset) == 0 (subset is homogenious)
return a leaf node
If Entropy(rootNode.subset)!= 0 (subset is not homogenious)
compute Information Gain for each attribute left (not been used for spliting)
Find attibute A with Maximum(Gain(S, A))
Create child nodes for this root node and add to rootNode in the decision tree
For each child of the rootNode
Apply ID3(S, A, V)
Continue until a node with Entropy close to 0 or a leaf node is reached
(Note : Understanding of the algorithm)
Following the algorithm the decision tree for this problem looks like this
2. Ensembles
Ensembles resembles an idea of using multiple models to create a more powerful model for the task.
Bagging and Random Forrest Classifier
In Bagging , we split the data in multiple Bootstrap samples , with repetition and each data a model is trained , in the end stage , the aggregation of all the models output is taken and the task is done.
Now if we change the data in Dn , only some sample with some models will be affected , which will not make the overall model loose its predictability power, also it reduces variance in the models without impacting the bias.
Random Forrest Classifier
Random Forrest Classifier : Decision Tress ( Base-Learners) +
Bagging(row-bagging) +
Column Sampling(column-bagging)
Data sample is generated by selecting examples from the data randomly with replacement. This data sample is used to learn a decision tree. Further, while choosing the next feature for classification, only a subset of randomly selected features is considered. An ensemble model is produced by using all decision trees generated.
Boosting and Gradient Boosting Classifier
Boosting tends to reduce bias and keep variance stable for the base learners in the ensemble
Boosting : Low Var & High bias + additively combine
Algorithm for boosting
Form a large set of simple featuresInitialize weights for training imagesFor T rounds Normalize the weights For available features from the set, train a classifier using a single feature and evaluate the training error Choose the classifier with the lowest error Update the weights of the training images: increase if classified wrongly by this classifier, decrease if correctlyForm the final strong classifier as the linear combination of the T classifiers (coefficient larger if training error is small)
Gradient Boosting Classifier
In gradient boosting , the residual error is just replaced with negative gradient of the loss function, as in if we have a differentiable loss function and a data set with high bias , applying Gradient boosting is the best choice.
Residual and Negative Gradient
Algorithm(Source):
Summary
In summary , it is important to understand how these statistical models differ from each other in terms of methodology and assumption making and what is the del factor between them for the usage in practical real time use.
Hope this blog gave a mathematical and intuitive understanding of these models.
Thank you.
LinkedIn : https://www.linkedin.com/in/raghav-agarwal-127539170