Comparative Analysis of different Supervised Machine Learning Algorithms for Classification

Abstract

Supervised machine learning classifiers are used to produce general model based on externally provided instances. These models then used to make predictions on newly provided instances. Supervised classification is one of the tasks most frequently carried out by the intelligent systems or expert systems. 

Many of the supervised machine learning algorithms has been introduce in literature. The aim of this study is to describe some of the Supervised Machine Learning (ML) classification techniques and also to compare them to determine the most efficient classification algorithm. Five algorithms are compared in this study, which are k-Nearest Neighbours (KNN), Naïve Bayes (NB), Decision Tree (DT), Random Forest (RF) and Support Vector Machine (SVM), over twenty datasets. Comparison of these algorithms is done on the basis of accuracy and time metrics. The results of the study are discussed in later sections.

Introduction

Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. One of the major applications of machine learning is data mining. Machine Learning is often applied for data analysis and solving problems efficiently.

ML algorithm use different datasets with different number of features and instances. If instances are given with known labels (the corresponding correct outputs) then the learning is called supervised, in contrast to unsupervised learning, where instances are unlabeled [14]. Figure shows the basic procedure of supervised machine learning classification.

In our study, we have considered classification problem: in which output of instance is nominal or discrete. We have done a basic test analysis on twenty different datasets. The datasets are analyzed using five of the most frequently used classification algorithms, which are KNN, Naïve Bayes, Decision Tree, Random Forest and SVM. To perform the experiment we have selected twenty datasets which are taken from UCI machine learning repository and KEEL repository. Weka tool is used to perform the experiment.  We have imported all the datasets in Weka and classify them on different classifiers to check their accuracy and time.

For every classifier we checked two things. First is percentage of correctly classified instances over complete training dataset set and also by splitting dataset 70% for training and 30% for testing. Second we have checked time to build model and to test data over complete data set and also by splitting dataset 70% for training and 30% for testing. In this work we attempted to explore which of the five algorithms performs well in normal conditions with respect to results accuracy and time. Additionally we have also focused to determine which algorithm mostly face over-fitting problem.

The rest of the paper is organized as follows:

Methodology

In this paper, we have considered five supervised machine learning classification algorithms. Algorithms are tested on twenty one dataset, which are mentioned in Table . We have used weka tool to explore these data sets. On all twenty one data sets, these five classifiers are implemented one by one. For testing of complete training set, 10-fold cross validation is used. Percentage split is also used to test data, in which 70% data is used for training and 30% is used for testing. In iteration control, number of repetitions is set to 3, where significance value is set to 0.05.

While applying KNN classifier, k is set to 3, where distance is calculated using Euclidian distance. In other algorithms, their parameters are set to default. To compare the performance of these classifiers, Percentage correct, Elapsed training time, Elapsed Testing Time, F-measure and Area under ROC is used.

F1: F1 score conveys the balance between the precision and the recall. Where

F1 = 2*((precision*recall)/(precision + recall))

Area under ROC: It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at prediction. 

DatasetNo of InstancesNo of Attributes
Autism Adult (adult-weka .filters)70421
abalone 41749
Avila1043711
banana53003
coil2000982286
letter2000017
magic1902011
Marketing187614
optdigits562065
page-blocks547211
Penbased1099217
Phoneme54046
ring 740021
Satimage643537
Segment231020
Thyroid720022
Titanic22014
Twonorm740021
winequality-white489812
phishing1105531
Texture550041

Figure shows a simple framework of our experiment.

In this paper, we have considered five supervised machine learning classification algorithms. Algorithms are tested on twenty one dataset, which are mentioned in Table . We have used weka tool to explore these data sets. On all twenty one data sets, these five classifiers are implemented one by one. For testing of complete training set, 10-fold cross validation is used. Percentage split is also used to test data, in which 70% data is used for training and 30% is used for testing. In iteration control, number of repetitions is set to 3, where significance value is set to 0.05.

While applying KNN classifier, k is set to 3, where distance is calculated using Euclidian distance. In other algorithms, their parameters are set to default. To compare the performance of these classifiers, Percentage correct, Elapsed training time, Elapsed Testing Time, F-measure and Area under ROC is used.

F1: F1 score conveys the balance between the precision and the recall. Where

F1 = 2*((precision*recall)/(precision + recall))

Area under ROC: It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at prediction. 

DatasetNo of InstancesNo of Attributes
Autism Adult (adult-weka .filters)70421
abalone 41749
Avila1043711
banana53003
coil2000982286
letter2000017
magic1902011
Marketing187614
optdigits562065
page-blocks547211
Penbased1099217
Phoneme54046
ring 740021
Satimage643537
Segment231020
Thyroid720022
Titanic22014
Twonorm740021
winequality-white489812
phishing1105531
Texture550041

Figure shows a simple framework of our experiment.

KNN

KNN is a classification algorithm first proposed by T.M.Cover and P.E.Hart [1]. It is widely used instance based classification algorithm, due to its simplicity and accuracy. It takes k closest training examples for classification of test example, that’s why it is called K-nearest neighbor. The value of k is selected by the user. The K-nearest neighbors are selected on the basis of distance between examples.  Different distance metrics are used to find the distance between instances or to check the similarity of instances. Table 1 shows different distance metrics. It does not generate a model at training time for generalization, and use whole training set at run-time or testing time for classification. Since training examples are needed at run time, they need to be in memory at runtime therefore it is also called Memory-Based Classifier.

Because classification is based directly on the training examples, it is also called Example-Based Classification or Case-Based Classification [2]. As induction is delayed to run time, it is considered a Lazy Learning technique [2]. It is a non-parametric classifier. A problem of “majority voting” classification occurs, when instances of a more frequent class tend to dominate the prediction of the new instance, since they tend to be common among the k-nearest neighbours due to their large number [3]. This problem can be solved using weighted KNN. The performance of KNN can be improved by choosing appropriate value of k and the distance metric applied.The odd value of k prevent classifier from biasness.

KNN is also has many real world applications, such as in text categorization or text mining, climate forecasting, estimating soil water parameters, in evaluation of forest inventories, forecasting stock market, currency exchange rate, Understanding and managing financial risk, Trading futures, Loan management, Bank customer profiling, Money laundering analyses, in estimating the amount of glucose in the blood of a diabetic person and in identifying the risk factors for prostate cancer etc [4].

Notations

Set of Classes: Ω = {ω1,ω2,…,ωc} is the set of classes.

Set of Training Instances (Training Set):

X = {(X1,c1),(X2,c2),…,(Xn,cn)} is the set of all training instances available. Here, each of Xi , for 1 ≤ i ≤ n is an example and ci is the class to which it belongs, So, ci ∈ Ω.

Algorithm

KNN(X,Y)

{Note: Y is test pattern}

For each Xin X:

             find di = D(Xi , Y)

 Xk = Select k instances from Xi with minimum di 

Select c the majority class from Xk

Output the class of Y is c

Mathematical Formulation

Naïve Bayes

Bayesian classifier is one of a statistical classifier. It is used to predict class membership probability. The class membership probabilities like the probability associated with the event of whether a given tuple belongs to a particular class, can be predicted by it [6]. Bayesian classifier is based on Bayes Theorem and the simplest form of Bayesian classifier is known as naïve Bayesian classifier [6]. Naïve Bayes is one of the useful classification techniques in Machine Learning and Data Mining. It is a simple probabilistic classifier. It is a generative model based classifier with a fast learning and testing process [5]. It assumes that the presence or absence of a particular feature of a class is unrelated to the presence or absence of any other feature.

In the learning process of this classifier with the known structure, class probabilities and conditional probabilities are calculated using training data, and then values of these probabilities are used to classify new observations [7]. The learning module constructs a probabilistic model of the features and uses that model to predict the classification of a new sample. It is used to solve both classification and regression problems. In case of numerical attributes it needs some extra processing for converting numerical values to categorical values, because this algorithm do not support numerical attributes. It’s called naive because the formulation makes some naïve assumptions. Naive Bayes is a reasonable classifier in this sense and has minimal storage and fast training; it is applied to time-storage critical applications, such as automatically classifying web pages into types and spam filtering [8]. It can be used in text classification, spam filtration, sentiment analysis and Recommendation systems.

Mathematical Formulation

Probability formula for Naïve Bayes:

  • P(C|X) is the posterior probability of class (target) given predictor (attribute). 
  • P(C) is the prior probability of class
  • P(X|C) is the likelihood which is the probability of predictor given class
  • P(X) is the prior probability of predictor.

If the training data contains continuous attribute then probability is computed by Gaussian distribution such as:

where

Algorithm

  1. Create a frequency table for all the features against the different target classes.
  2. Draw the likelihood table for the features against the classes
  3. Calculate the Posterior probabilities for all the classes, based on given formulae.

Decision Tree

Decision tree is a supervised machine learning classifier expressed as a recursive partition of instance space or training set. It consists of nodes that generate the rooted tree. The nodes with outgoing edges are called internal or test nodes. The nodes with no outgoing edges are called leaves, terminals or decision nodes. Each leaf is assigned with most appropriate target class. Instances are classified by navigating from root to the leaf, according to the outcomes along the path. In general, less complex decision trees are preferred. According to Breiman et al. (1984) the tree complexity has a crucial effect on its accuracy. Usually the tree complexity is measured by one of the following metrics: the total number of nodes, total number of leaves, tree depth and number of attributes used [9]. Normally stopping criteria and pruning method are used for controlling tree complexity.

The main goal in construction of decision tree is to find an optimal decision tree by minimizing the generalization error. Other target may be to minimize the complexity of tree. However to find an optimal decision tree is not an easy task. It has been shown that finding a minimal decision tree consistent with the training set is NP–hard (Hancock et al., 1996) [9]. Decision tree follows a top down approach. There are various top–down decision trees inducers such as ID3, C4.5 and CART [9]. 

Decision tree is greedy and a recursive algorithm, that follow a top-down divide and conquer approach. Decision tree recursively partition the training data set until all the examples belongs to a particular class. Mostly decision tree classifiers perform classification in two steps:  tree-growing (or building) and tree-pruning. In tree building phase the tree is recursively partitioned till all the data items belong to the same class label. In the tree pruning phase the full grown tree is cut back to prevent over fitting and improve the accuracy of the tree in bottom up fashion [11]. 

Decision tree has a wide range of applications such as in Business, Intrusion Detection, Energy Modeling, E-Commerce, Image Processing, Medicine, Industry, Intelligent Vehicles, Remote Sensing and Web Applications [10].

ID3

ID3(Iterative Dichotomized) algorithm is based on Concept Learning System (CLS) algorithm [11]. In growth phase, the algorithm recursively partition the training set until the record sets belong to the class label. To select the best attribute for splitting, the algorithm use entropy based information gain. The attribute with highest information gain gets selected for splitting. Classification results of ID3 are affected due to large dataset and noise in dataset, therefore preprocessing of data should be done to remove noise. It only accepts categorical attributes in building a tree model.

C4.5

C4.5 algorithm is an improved version of ID3, this algorithm uses Gain Ratio as a splitting criteria, instead of taking gain in ID3 algorithm for splitting criteria in tree growth phase [11]. It accepts both continuous and discrete attributes. To handle continuous attributes, C4.5 specifies a threshold to split the list (discretization). Algorithm also handles missing values.

CART

Unlike ID3 and C4.5, CART (Classification and Regression Trees) generates a binary tree. It is also used for solving regression problems by generating regression tree. The CART decision tree is a binary recursive partitioning procedure capable of processing continuous and nominal attributes both as targets and predictors [11]. In CART trees are grown, uses gini index for splitting procedure, to a maximum size without the use of a stopping rule and then pruned back (essentially split by split) to the root via cost-complexity pruning [11]. The CART mechanism includes automatic probability tree estimation, dynamic feature construction, cost-sensitive learning, missing value handling and (optional) class balancing.

Splitting Criteria

Algorithms

Following are the algorithms for tree building and pruning from [11]:

  Build_Tree (data set S)
if all records in S belong to the same class,
return;
for each attribute Ai
evaluate splits on attribute Ai ;
use best split found, to partition S into S1and S2 ;
Build_Tree (S1);
Build_Tree (S2);
End Build_Tree;  
  Prune_Tree (node t)
if t is leaf return C(S) +1
/* C(S) is the cost of encoding  the classes for the records in set S */
minCost1:= Prune_Tree (t1); minCost2:= Prune_Tree (t2);         /* t1, t2 are t’s children*/ minCostt:= min{ C(S)+1, Csplit(t)+1+minCost1+ minCost2 };
return minCostt;                                               /* Csplit: cost of encoding a split */
End Prune_Tree;  

Random Forest

Random forests (Breiman, 2001) is a substantial modification of bagging that builds a large collection of de-correlated trees, and then averages them [12]. Bagging or bootstrap aggregation is a technique that use multiple classifiers modeled on different data samples. It reduces high variance in predicted results by combining the results of multiple classifiers. These techniques are used to improve the resulting predictions of decision tree.

On many problems the performance of random forests is very similar to boosting, and they are simpler to train and tune [12]. Boosting is also a technique based on multiple tree based classifier, used to improve tree predictions. In boosting trees grown in sequential way, each tree is constructed by the information from the previous tree.

 A random forest is a classifier consisting of a collection of tree-structured classifiers {t(x,Θb),b =1,…,B} where the {Θb} are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input x [13].

Algorithm

  1. For b = 1 to B:
(a) Draw a bootstrap sample Z of size N from the training data.
(b) Grow a random-forest tree Tb to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size nmin is reached.  
i. Select m variables at random from the p variables.  
ii. Pick the best variable/split-point among the m.  
iii. Split the node into two daughter nodes.
2. Output the ensemble of trees {Tb}1B
 To make a prediction at a new point x:
Classification: From all the predictions of trees {Tb}1B, choosethe most frequent predictions call it x.
3. Returnx.  

N is the size of training set.

p is the size of bootstrapped dataset.

m is the selected number of feature from bootstrapped datasets.

For classification, the value selected for m is  and the minimum node size is one. From training set, about 2/3rd instances are selected as bootstrap dataset and the remaining 1/3rd instances are said to be out-of-bag (OOB) instances. These OOB instances are passed through the random forest to check the accuracy of classifier.

Parameter: The only parameter to be tune in Random forest is number of trees.

SVM

A Support Vector Machine (SVM) is a discriminative machine learning classifier formally defined by a separating hyperplane or decision plane. A hyper plane separates given labeled training data. Classification tasks based on drawing separating lines to distinguish between objects of different class memberships are known as hyperplane classifiers [15]. The output of SVM is an optimal hyperplane which categorizes or classifies new examples. In two dimensional space this hyperplane is a line dividing that plane in two parts, where instances a single class lies on each side of line. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. A schematic example is shown below, where the boundary line classifies two classes.

Figure above shows a classic example of a linear classifier. It depicts a simple problem of classification, but in most cases classification is not that simple. Complex structures are needed to correctly classify the data. One such situation is depicted in the example below:

Figure above shows an example of non-linear SVM. To solve non-linear problem, a set of mathematical functions is needed, known as kernel functions. These functions are used to rearrange the objects. The process of rearranging these objects is called Transformation or mapping. One such mapping is depicted below:

The transformation has mapped the non-linear problem to linear problem, so instead of drawing a complex curve we need to find an optimal line which separates the two classes.

The problem here is: there are many hyperplanes that might classify the data. It is a standard to choose the hyperplane that represents largest separation, or margin, between the two support vectors. Support vectors are objects or data points that are closer to the hyperplane.

Tuning Parameters:

Kernel

Regularization

For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points [16].

Gamma

Low gamma, points far away from plausible separation line are considered in calculation for the separation line. Where as high gamma means the points close to plausible line are considered in calculation [16].

Margin

The margin is a good margin if separation from both sides is equidistant.

Similar Posts