Lesson 1.13: Decision Tree Learning


A tree-like model that illustrates series of events leading to certain decisions. Each node represents a test on an attribute and each branch is an outcome of that test.

  • We use labeled data to obtain a suitable decision tree for future predictions.
  • We want a decision tree that works well on unseen data, while asking as few questions as possible.
  • Basic step: choose an attribute and, based on its values, split the data into smaller sets.
  • Recursively repeat this step until we can surely decide the label.

Pseudocode for making a decision tree

Decision Tree Learning Process

  • Input: Labeled training data (features + target class)
  • Goal: Build a tree that:
    • Generalizes well to unseen data
    • Makes decisions with minimal questions (shallow depth)
  • Recursive Splitting:
    • Select best attribute at each node
    • Split data into subsets based on attribute values
    • Repeat until stopping criteria met

When to Create a Leaf Node?

A node becomes a leaf when:

  • Pure node: All samples belong to one class (e.g., 100% "Yes" or 100% "No")
  • No remaining features to split on
  • Pre-pruning: Early stopping if split doesn't improve purity significantly

Entropy

Entropy measures the uncertainty or randomness in a dataset. It quantifies how "mixed" the class labels are in a given subset of data.

Information Gain

Information Gain (IG) measures how much an attribute reduces uncertainty (entropy) about the class when splitting a dataset. It helps select the best attribute for decision tree nodes.

  • For attribute A and dataset S:
  • IG(S,A)=H(S)vValues(A)SvSH(Sv)\text{IG}(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)
  • Where:
    • H(S)H(S): Entropy of the original set SS
    • SvS_v : Subset of SS where attribute AA has value vv
    • SvS \frac{|S_v|}{|S|} : Proportion of samples in subset SvS_v
  • Goal: Maximize IG → Choose the attribute that most reduces entropy after splitting.

Step-by-Step Example

Dataset: Weather Conditions → Play Tennis?

OutlookHumidityWindPlay Tennis?
SunnyHighWeakNo
SunnyHighStrongNo
OvercastHighWeakYes
RainyHighWeakYes
RainyNormalWeakYes
RainyNormalStrongNo
OvercastNormalStrongYes
SunnyHighWeakNo

Target: Predict "Play Tennis?" (Yes/No)

Information Gain Calculation in Decision Trees

Step 1: Calculate Original Entropy H(S)H(S)

Dataset Statistics:

  • Total samples: 8
  • Class distribution:
    • "Yes" (Play Tennis): 5
    • "No" (Don't Play): 3

Entropy Formula: H(S)=(pYeslog2(pYes)+pNolog2(pNo))H(S) = -\left( p_{\text{Yes}} \log_2(p_{\text{Yes}}) + p_{\text{No}} \log_2(p_{\text{No}}) \right)

Calculation: H(S)=(58log2(58)+38log2(38))0.954 bitsH(S) = -\left( \frac{5}{8} \log_2\left(\frac{5}{8}\right) + \frac{3}{8} \log_2\left(\frac{3}{8}\right) \right) \approx 0.954 \text{ bits}

Step 2: Calculate Entropy After Splitting on "Outlook"

Attribute Values:

  • Sunny (3 samples)
  • Overcast (2 samples)
  • Rainy (3 samples)
a) Sunny Subset (SSunnyS_{\text{Sunny}})
  • Samples: 3 "No", 0 "Yes" H(SSunny)=0 bitsH(S_{\text{Sunny}}) = 0 \text{ bits}

b) Overcast Subset (SOvercastS_{\text{Overcast}})

  • Samples: 2 "Yes", 0 "No" H(SOvercast)=0 bitsH(S_{\text{Overcast}}) = 0 \text{ bits}

c) Rainy Subset (SRainyS_{\text{Rainy}})

  • Samples: 2 "Yes", 1 "No" H(SRainy)=(23log2(23)+13log2(13))0.918 bitsH(S_{\text{Rainy}}) = -\left( \frac{2}{3} \log_2\left(\frac{2}{3}\right) + \frac{1}{3} \log_2\left(\frac{1}{3}\right) \right) \approx 0.918 \text{ bits}

Weighted Average Entropy: H(SOutlook)=(38×0+28×0+38×0.918)0.344 bitsH(S|\text{Outlook}) = \left( \frac{3}{8} \times 0 + \frac{2}{8} \times 0 + \frac{3}{8} \times 0.918 \right) \approx 0.344 \text{ bits}

Step 3: Compute Information Gain for "Outlook"

Information Gain Formula: IG(S,Outlook)=H(S)H(SOutlook)\text{IG}(S, \text{Outlook}) = H(S) - H(S|\text{Outlook})

Calculation: IG(S,Outlook)=0.9540.344=0.610 bits\text{IG}(S, \text{Outlook}) = 0.954 - 0.344 = 0.610 \text{ bits}

Comparison with Other Attributes

AttributeInformation Gain (bits)
Outlook0.610
Humidity0.347
Wind0.003

Key Takeaways

  1. Higher IG means better attribute for splitting
  2. "Outlook" provides the most information about the target variable
  3. The calculation follows:
    • Compute original entropy
    • Calculate weighted entropy after split
    • Subtract to get information gain

Pruning

Handling Numerical Attributes

Handling Missing Attributes

Random Forests (Ensemble Learning With Decision Trees)

  • Instead of building a single decision tree and use it to make predictions, build many slightly different trees and combine their predictions.
  • We have a single data set, so how do we obtain slightly different trees?
    1. Bagging (Bootstrap Aggregating):
      • Take random subsets of data from the training set to create N smaller data sets.
      • Fit a decision tree on each subset.
    2. Random Subspace Method (also known as Feature Bagging):
      • Fit N different decision trees by constraining each one to operate on a random subset of features.

Ensemble Learning

  • Ensemble Learning:
    • Method that combines multiple learning algorithms to obtain performance improvements over its components.
  • Random Forests are one of the most common examples of ensemble learning.
  • Other commonly-used ensemble methods:
    • Bagging: multiple models on random subsets of data samples.
    • Random Subspace Method: multiple models on random subsets of features.
    • Boosting: train models iteratively, while making the current model focus on the mistakes of the previous ones by increasing the weight of misclassified samples.

Random Forests are an ensemble learning method that employ decision tree learning to build multiple trees through bagging and random subspace method. They rectify the overfitting problem of decision trees!

All systems normal

© 2025 2023 Sanjeeb KC. All rights reserved.