Random Forests

Random forests are an ensemble method (generally, bagging) constructing multiple decision trees on random samples with replacement of the training set. It is a supervised learning algorithm and can be used for classification and regression problems. It gets prediction from each tree and outputs the mode of classes for classification or the mean prediction for regression.

The random forest algorithm is not just the bagging algorithm for trees. It introduces randomness when growing trees. At each candidate split, it selects the best feature among a random subset of the features (called feature bagging). This approach leads more tree diversity and a lower variance. Another benefit from the random forests is that it can be used to measure the importance of features. We will look at a python code to see the scores of the importance later.

Extra Trees

Extra Trees (short for Extremely Randomized Trees) are a kind of random forests. The main differences are: 1) each tree is trained using the whole training set rather than a random subset, and 2) random thresholds are used for each feature rather than searching for the best thresholds for each feature when a node splits. Of all the randomly generated splits, the split that has the highest score is selected to split the node.

Python Code

Let’s take a look at how Random forests make predictions on a famous dataset, the iris dataset, for classification. I will skip details about the dataset. You could easily find well-written information about it online.

from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()

data = pd.DataFrame({
    'sepal length':iris.data[:,0],
    'sepal width':iris.data[:,1],
    'petal length':iris.data[:,2],
    'petal width':iris.data[:,3],
    'species':iris.target
})

data.head()

This loads the iris dataset and store it into a pandas dataframe. The data looks like:

The “species” is the target variable. We predict the labels (0:setosa, 1:versicolor, 2:virginica). Now, let’s split the dataset into 80% training and 20% test sets.

from sklearn.model_selection import train_test_split

X = data[['sepal length', 'sepal width', 'petal length', 'petal width']] 
y = data['species']  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

The following code is an example of training a random forest classifier with 100 trees and 16 limited maximum nodes for each tree:

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16)
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)

The accuracy shows around 93%.

from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9333333333

Let’s see the importance (score) of each feature.

for name, score in zip(iris["feature_names"], clf.feature_importances_):
    print(name, score)

This result shows that the petal width and length features are more important than the sepal width and length.

Posts

Gradient Descent

Batch vs. Online Learning

Knowledge Graph Completion

What are Embeddings?

Batch Normalization

Gradient Boosting

The Normal Equation

Handling Categorical Values

How to Deal with Missing Values

Underfitting vs. Overfitting