Into the Machine Learning Woods: The Random Forest.

Ayoub_Ali
4 min readJul 14, 2023

Random forests inherit the benefits of a decision tree model whilst improving upon the performance by reducing the variance. — Jeremy Jordan

Random Forest is a popular and powerful ensemble learning algorithm that combines multiple decision trees to generate accurate and stable predictions. In this blog post, we will delve into the workings of Random Forest, its advantages, and when to consider using it. We will also explore key hyperparameters and demonstrate how to tune them using GridSearchCV to create the best model possible. Let’s dive in!

What is Random Forest?

At a high level, Random Forest merges independent decision trees to improve prediction accuracy and stability. It falls under the umbrella of ensemble methods, which leverage multiple models to reduce bias and variance. Unlike some complex algorithms, Random Forest is relatively easy to comprehend. It generates N independent decision trees that focus on different aspects of the data, resulting in superior predictions when combined.

Example: the Titanic dataset

# Import necessary libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Titanic dataset
titanic_data = pd.read_csv("titanic.csv")
# Data preprocessing
# Drop unnecessary columns or handle missing values as per your requirements
titanic_data = titanic_data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1)
titanic_data["Age"].fillna(titanic_data["Age"].mean(), inplace=True)
titanic_data["Embarked"].fillna(titanic_data["Embarked"].mode()[0], inplace=True)
titanic_data = pd.get_dummies(titanic_data, columns=["Sex", "Embarked"])

# Split the dataset into features (X) and target variable (y)
X = titanic_data.drop("Survived", axis=1)
y = titanic_data["Survived"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf_model.fit(X_train, y_train)

Understanding Random Forest Training:

During the training phase, Random Forest selects N samples from the training data through bootstrapping (sampling with replacement). It also randomly selects a subset of features for each data sample. This creates N subsets of the data, with each subset having a combination of rows and columns. Independent decision trees are built on these subsets, aiming to generate the most accurate results. It’s important to note that the decision trees are developed individually, unaware of what other trees are doing. This independence enables each tree to capture different relationships within the data, leading to optimal predictions.

Making Predictions with Random Forest:

When making predictions with Random Forest, each example from the test set traverses through all N trees. At each tree, the example’s features guide its path down the tree’s branches until a prediction is generated. Each tree produces its own prediction, resulting in N predictions for a single example. The Random Forest algorithm then aggregates these predictions, usually employing majority voting, to make the final prediction. This approach leverages the diverse insights captured by the independent trees, enhancing the overall accuracy of the model.

# Make predictions on the test set
y_pred = rf_model.predict(X_test)

When to Use Random Forest:

Random Forest is suitable for both categorical and continuous target variables, making it a versatile algorithm. It serves as an excellent benchmark model, delivering good performance while being relatively fast to train. Random Forest excels in handling messy data with missing values, outliers, and skewed distributions. Its ability to handle such challenges is a significant advantage. However, if you require a model that achieves the utmost performance and demands interpretability within the model’s details, Random Forest might not be the ideal choice.

Understanding Hyperparameters and Tuning:

Random Forest offers various hyperparameters to fine-tune the model. Two crucial hyperparameters to focus on are:

  • n_estimators: This parameter determines the number of decision trees to build. Increasing the number of estimators can enhance model performance, but there’s a trade-off with increased computational cost.
  • max_depth: This hyperparameter controls the depth of each individual decision tree. It restricts the complexity of the model and helps prevent overfitting. Finding the right balance is key.

Hyperparameter Tuning with GridSearchCV:

To optimize the Random Forest model, GridSearchCV can be utilized. It allows us to search through different hyperparameter combinations using cross-validation. By specifying a range of values for n_estimators and max_depth, we can systematically evaluate various settings and identify the best-performing combination. The chosen configuration can yield higher accuracy and better generalization.

# Perform GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best hyperparameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

--

--