What should we eat tonight? You decide.

Decision trees are fundamental to machine learning. They are the building blocks for ensemble models, and introduce key concepts like information gain, entropy.

https://youtu.be/v6VJ2RO66Ag?si=pfufzqXeOOqfaROM

https://youtu.be/_L39rN6gz7Y?si=zG_kOhdZWKIMN8QN


Your client asks you for…

A Decision Tree!

Decision Trees

A non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple if/else rules inferred from the data features. A tree can be seen as a piecewise constant approximation.


↳ What is a Decision Trees most important feature

☞ Their explainability and transparency. They provide not only a decision but the decison making process.


What is the hierarchy of a Decision tree?
         root
       /      \
   branch    branch
   /    \    /    \
leaf  leaf  leaf leaf

Branches are also called internal nodes



What is a Decision tree with only 1 root, and 2 leaves called?

A decision stump; the most basic form of a decision tree.



  1. Random Forests: This is an ensemble learning technique that constructs a multitude of decision trees at training time. For classification tasks, the output of the random forest is the mode of the classes (the most frequent class) output by individual trees. In regression tasks, it’s the mean prediction of the individual trees. Random forests correct for decision trees’ habit of overfitting to their training set.

  2. Gradient Boosting Machines (GBM): These involve an ensemble of decision trees. It builds the model in a stage-wise fashion like other boosting methods do, but it generalizes them by allowing optimization of an arbitrary differentiable loss function.

  3. XGBoost (Extreme Gradient Boosting): This is an efficient and scalable implementation of gradient boosting. It has gained popularity in machine learning competitions due to its speed and performance. XGBoost includes regularization (L1 and L2), which improves model generalization capabilities.

  4. LightGBM: A gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages: faster training speed and higher efficiency, lower memory usage, and better accuracy. It can handle large-scale data and is used for ranking, classification, and many other machine learning tasks.

  5. CatBoost: An algorithm for gradient boosting on decision trees. Developed by Yandex researchers and engineers, it is especially powerful for handling categorical features without much need for data preprocessing (like one-hot encoding).

  6. AdaBoost (Adaptive Boosting): This model works by combining multiple weak learners into a single strong learner. In the context of decision trees, weak learners are usually shallow trees. AdaBoost changes the weights of the training data based on the previous model’s success and continues this process iteratively, improving its predictions.

  7. Decision Stump: A decision tree with only a single split. It’s often used as a weak learner in the context of boosting algorithms.


When should you use a decision tree?
  • Explaianability
What are the steps in making a decision tree?
  • Using a pen and paper, fill in the following to hm compute a decision tree
What other models are made up of Decision Trees?
  • information gain and entropy

  • x2.

  • works for non-linear data! - the true power of decision trees

  • Binary tree which recursively splits the dataset into pure leaf nodes - data which splits the data into one class (target class)

  • isn’t this just if else?

    • how is it ML?
      • Information theory & how this simple tree is actually CONSTRUCTED (cos how on earth would you know)
        • Entropy
Cool, you know to apply a decision tree. But how do they actually work?
So how on earth is the structure of the tree computed? How do you know what the best way to split each node is?

In simple terms, this just means making the subsets of data as homogeneous as possible (all of the same class). We know from this question that entropy is lowest when all datapoints are of the same class (minimum uncertainty), and entropy is highest when there is an even distribution of classes (maximum uncertainty).

Cross entropy (reminder)

Q3.1) List three different methods for quantifying the ‘impurity’ of a leaf node in a decision tree
  • Gini Impurity
  • Entropy
  • Information Gain
Q3.2) Follow up: step me through calculating the Gini Impurity of this leaf node
Likes chocolateDoes not like chocolate
183