Random Forest

With excellent performance on all eight metrics, calibrated boosted trees were the best learning algorithm overall. Random forests are close second, followed by uncalibrated bagged trees, calibrated SVMs, and un- calibrated neural nets.

— Rich Caruana, Alexandru Niculescu-Mizil

link: https://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf

Definition

Decision Tree is a schematic, tree-shaped diagram used to determine a course of action or show a statistical probability. Each branch of the decision tree represents a possible decision, occurrence or reaction. The tree is structured to show how and why one choice may lead to the next, with the use of the branches indicating each option is mutually exclusive.

Random Forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest.

Algorithm

Splitting Criteria

  • Information Gain
  • Entropy Gain
  • Variance
  • Gini Index (Binary only)
  • Chi Square
  • Etc

Example of Good Split and Bad Split

Gini Coefficient illustration

Regression or Classification

1 For t = 1 to B: (Construct B trees)

1.1 Choose a bootstrap sample D(t) from D of size N from the training data.

1.2 Grow a random-forest tree T(i) to the bootstrapped data, by recursively repeating the following steps for each leaf node of the tree, until the minimum node size n(min) is reached.

1.2.1 Select m variables at random from the M variables

1.2.2 Pick the best variable/split-point among the m.

1.2.3 Split the node into two daughter nodes

2 Output the ensemble of trees {Tb}^B(1)

Bagging

img

Bootstrapping

Ensemble

Pros vs Cons

Pros

  • Can solve both type of problems, classification and regression
  • Random forests generalize well to new data
  • It is unexcelled in accuracy among current algorithms
  • It runs efficiently on large data bases and can handle thousands of input variables without variable deletion
  • It gives estimates of what variables are important in the classification
  • It generates an internal unbiased estimate of the generalization error as the forest building progresses
  • It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing
  • It computes proximities between pairs of cases that can be used in clustering, locating outliers, or give interesting views of the data
  • Out-of-bag error estimate removes the need for a set aside test set

Cons

  • The results are less actionable because forests are not easily interpreted. Considered black box approach for statistical modelers with little control on what the model does. Similar to a Neural Network.
  • It surely does a good job at classification but not as good as for regression problem as it does not give precise continuous nature predictions. In case of regression, it doesn’t predict beyond the range in the training data, and that they may over-fit data sets that are particularly noisy.

Application

Python

# Import the model we are using
from sklearn.ensemble import RandomForestRegressor

# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)

# Train the model on training data
rf.fit(train_features, train_labels);

# Use the forest's predict method on the test data
predictions = rf.predict(test_features)

# Calculate the absolute errors
errors = abs(predictions - test_labels)

# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')

Mean Absolute Error: 3.83 degrees.

# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels)

# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

Accuracy: 93.99 %.

R

install.packages("randomForest")
library(randomForest)
	
# Create a Random Forest model with default parameters
model1 <- randomForest(Condition ~ ., data = TrainSet, importance = TRUE)
model1

# Fine tuning parameters of Random Forest model
model2 <- randomForest(Condition ~ ., data = TrainSet, ntree = 500, mtry = 6, importance = TRUE)
model2

# Predicting on train set
predTrain <- predict(model2, TrainSet, type = "class")

# Checking classification accuracy
table(predTrain, TrainSet$Condition)  

# Import the model we are using
from sklearn.ensemble import RandomForestRegressor

# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)

# Train the model on training data
rf.fit(train_features, train_labels);

# Use the forest's predict method on the test data
predictions = rf.predict(test_features)

# Calculate the absolute errors
errors = abs(predictions - test_labels)

# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')

Mean Absolute Error: 3.83 degrees.

# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels)

# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

Accuracy: 93.99 %.

With excellent performance on all eight metrics, calibrated boosted trees were the best learning algorithm overall. Random forests are close second, followed by uncalibrated bagged trees, calibrated SVMs, and un- calibrated neural nets.
— Rich Caruana, Alexandru Niculescu-Mizil

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s