Decision Tree Explained

What is a Decision Tree?
A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes).
The paths from root to leaf represent classification rules.

Decision tree builds regression or classification models in the form of a tree structure. It divides a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

A decision node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy). Leaf node (e.g., Play) represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor called root node.

NOTE: Decision trees can handle both categorical and numerical data.

Decision Trees follows Divide and conquer method:
1. Split the data into subsets to form the decision rules.
2. Check which subset the new data falls into.

Incase, any subset is a "pure subset" (a subset with only one type of class), then there is no need to split that particular subset. Splitting is required for those subsets which are not pure.

Decision Tree Algorithms
1. ID3 Algorithm (Iterative Dichotomiser 3)
2. CART Algorithm (Classification and Regression trees)

A. ID3 Algorithm
Pseudocode:
a. split node:
1. A --- best feature for splitting the data.
2. This feature A has a number of possible values, so we use A and split accordingly.
3. for each possible value of A, create a new child.
4. split the data to the child nodes accordingly.
5. In each child node:
i. if subset is pure: STOP splitting
ii. else: SPLIT (child node)

Which attribute/feature to split on?
Now comes the first part of the algorithm. Which is the best split for the data?

For this, ID3 uses Entropy and Information Gain to construct the decision tree.

Entropy:
A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogenous/pure).
ID3 algorithm uses entropy to calculate the homogeneity of a sample. If the sample is completely homogeneous then entropy is zero and if the sample is an equally divided then it has entropy of one.

We need to calculate two types of entropy using frequency tables as follows:
1. Entropy using the frequency table of the Target Variable:

2. Entropy using the frequency table of two attributes -- Target variable and all the predictors individually.

Information Gain:
The information gain is based on the decrease in entropy after a dataset is splitted on an attribute.
Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).

Lets use the following dataset to build our Decision Tree:

STEP 1: Calculating Entropy of the Target Variable:

STEP 2: Calculating Entropy for each predictors individually with the Target Variable:

Similarly, calculate entropy for all the predictors.

STEP 3: Calculating Information Gain.
Substract the entropy of the predictors from the entropy of the Target variable. The result is the Information Gain, or decrease in entropy.

STEP 4: Choosing the best split
Choose attribute with the largest information gain as the decision node, divide the dataset by its branches and repeat the same process on every branch.

NOTE:
1. A branch with entropy of 0 is a leaf node.
2. A branch with entropy more than 0 needs further splitting.
3. The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified.

B. CART Algorithm The CART algorithm is structured as a sequence of questions, the answers to which determine what the next question, if any should be. The result of these questions is a tree like structure where the ends are terminal nodes at which point there are no more questions.
The decision tree built by CART is always a binary decision tree, i.e. each node will have only 2 child nodes.
A simple example of a decision tree is as follows:

The impurity/purity measure used in building decision trees using CART is GINI INDEX.
GINI INDEX:
Gini index function is used in classification trees which provides an indication of how “pure” the leaf nodes are.

where,

p_{j}

is the proportion of observations with target variable j.
In case the target variable is binary, then the maximum Gini Index is 0.5 when the classes are perfectly mixed and minimum value is 0, when we have a homogenous/pure set.

We need to calculate 2 types of Gini:
1. Gini of the target variable.
2. Gini of the attribute/feature after split.

STEP 1: Calculating Gini of the target variable:

Gini = 1 - 0.75^2 - 0.25^2 = 0.375

STEP 2: Calculating Gini of the attribute/feature after the split.

Gini(TL) = 1 - (2/7)^2 - (5/7)^2 = 0.4118

Gini(TR) = 1 - (3/13)^2 - (10/13)^2 = 0.4181

Similarly, calculate Gini(S,T) for all attributes/features.

STEP 3: Choosing the best split.
The attribute with the highest Gini value [Gini(S, T)] is favoured for the split for the decision node.

STEP 4: Repeat the process until all the samples are classified.

Overfitting and Pruning:Overfitting happens when the learning algorithm continues to develop hypothesis that minimizes the training error at the cost of an increased test error, i.e., it fails to validate on the test data.
Solution to Overfitting in decision tree ---> "Pruning"
1. Pre-pruning : stop growing the tree early, before it perfectly classifies the training data.
2. Post-pruning: allows the tree to perfectly classify the training set, and then post prune the tree.
Practically, post-pruning is more successful than pre-pruning because it is not easy to precisely estimate when to stop growing the tree.

Search This Blog

Gateway to AI

Decision Tree Explained

Comments

Post a Comment

Popular posts from this blog

A/B Testing: Bernoulli Model VS the Beta-Binomial Hierarchical Model

Fraud Detection in Financial Data

Recurrent Neural Networks and LSTM explained