Towards Dev

A publication for sharing projects, ideas, codes, and new theories.

Follow publication

Ultimate Guide: Classification and Regression Tree (CART)

The decision tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, the decision tree algorithm can be used for solving regression and classification problems too. The goal of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from prior data(training data).

Source: Unknown

Important Terminology related to Decision Trees

  • Root node: Represents the total population or a tiny portion of it. Root nodes can be divided into two or more homogeneous datasets.
  • Decision node: A sub-node that is further divided into sub-nodes.
  • Pruning: The process of deleting a sub-node from a decision node.
  • Splitting: A process to divide a node into different subnodes.
  • Terminal or leaf node: The nodes that don’t split.
  • Parent and child node: A node divided into sub-nodes. The sub-nodes from the parent node are known as the child node.
  • Branch or sub-tree: A subset of the entire tree.
Source: Unknown

Classification And Regression Tree (CART)

CART( Classification And Regression Tree) is a variation of the decision tree algorithm. It can handle both classification and regression tasks. Scikit-Learn uses the Classification And Regression Tree (CART) algorithm to train Decision Trees (also called “growing” trees). CART was first produced by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone in 1984. These algorithms serve as a base of machine learning algorithms like bagged decision trees, boosted decision trees, or random forests.

Data preparation for the CART: No special data preparation is required for the CART algorithm.

There are marked differences between regression trees and classification trees.

  • Regression trees: Predict continuous values depending on information sources or previous data. For instance, to predict the price of an item, previous data needs to be analyzed.
  • Classification trees: Determine whether an event occurred. It usually has outcomes as either yes or no. This type of decision tree algorithm is often used in real-world decision-making.

Gini index

Decision trees are often used while implementing machine learning algorithms. The hierarchical structure of a decision tree leads us to the final outcome by traversing through the nodes of the tree. Each node consists of an attribute or feature which is further split into more nodes as we move down the tree. But how do we decide:

  • Which attribute/feature should be placed at the root node?
  • Which features will act as internal nodes or leaf nodes?

To decide this, and how to split the tree, we use splitting measures like Gini Index, Information Gain, etc.

Gini Index or Gini impurity measures the degree or probability of a particular variable being wrongly classified when it is randomly chosen.

But what is actually meant by ‘impurity’?

If all the elements belong to a single class, then they can be called pure. The degree of the Gini Index varies between 0 and 1,

  1. ‘0’ denotes that all elements belong to a certain class or there exists only one class (pure).
  2. ‘1’ denotes that the elements are randomly distributed across various classes (impure).
  3. A Gini Index of ‘0.5 ‘denotes equally distributed elements into some classes.
Source: Unknown

Advantages of CART

  • Results are simplistic.
  • Classification and regression trees are Nonparametric and Nonlinear.
  • Classification and regression trees implicitly perform feature selection.
  • Outliers have no meaningful effect on CART.
  • It requires minimal supervision and produces easy-to-understand models.

Limitations of CART

  • Overfitting.
  • High Variance.
  • low bias. (model makes simple assumptions to fit data)
  • the tree structure may be unstable.

Applications of the CART algorithm

  • For quick Data insights.
  • In Blood Donors Classification.
  • For environmental and ecological data.
  • In the financial sector.

Conclusion

This article has covered basic theories related to CART. Its classification and regression implementation will be published in further articles.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Published in Towards Dev

A publication for sharing projects, ideas, codes, and new theories.

Written by Mayuresh Nightshade

M.Tech ( Robotics and Automation ), Data science enthusiast

No responses yet

Write a response