With excellent performance on all eight metrics, calibrated boosted trees were the best learning algorithm overall. Random forests are close second, followed by uncalibrated bagged trees, calibrated SVMs, and un- calibrated neural nets.
— Rich Caruana, Alexandru Niculescu-Mizil
link: https://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf
Definition
Decision Tree is a schematic, tree-shaped diagram used to determine a course of action or show a statistical probability. Each branch of the decision tree represents a possible decision, occurrence or reaction. The tree is structured to show how and why one choice may lead to the next, with the use of the branches indicating each option is mutually exclusive.
Random Forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest.
Algorithm
Splitting Criteria
- Information Gain
- Entropy Gain
- Variance
- Gini Index (Binary only)
- Chi Square
- Etc
Example of Good Split and Bad Split
Regression or Classification
1 For t = 1 to B: (Construct B trees)
1.1 Choose a bootstrap sample D(t) from D of size N from the training data.
1.2 Grow a random-forest tree T(i) to the bootstrapped data, by recursively repeating the following steps for each leaf node of the tree, until the minimum node size n(min) is reached.
1.2.1 Select m variables at random from the M variables
1.2.2 Pick the best variable/split-point among the m.
1.2.3 Split the node into two daughter nodes
2 Output the ensemble of trees {Tb}^B(1)
Bagging
Bootstrapping
Ensemble
Pros vs Cons
Pros
- Can solve both type of problems, classification and regression
- Random forests generalize well to new data
- It is unexcelled in accuracy among current algorithms
- It runs efficiently on large data bases and can handle thousands of input variables without variable deletion
- It gives estimates of what variables are important in the classification
- It generates an internal unbiased estimate of the generalization error as the forest building progresses
- It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing
- It computes proximities between pairs of cases that can be used in clustering, locating outliers, or give interesting views of the data
- Out-of-bag error estimate removes the need for a set aside test set
Cons
- The results are less actionable because forests are not easily interpreted. Considered black box approach for statistical modelers with little control on what the model does. Similar to a Neural Network.
- It surely does a good job at classification but not as good as for regression problem as it does not give precise continuous nature predictions. In case of regression, it doesn’t predict beyond the range in the training data, and that they may over-fit data sets that are particularly noisy.
Application
Python
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
# Train the model on training data
rf.fit(train_features, train_labels);
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)
# Calculate the absolute errors
errors = abs(predictions - test_labels)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
Mean Absolute Error: 3.83 degrees.
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels)
# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')
Accuracy: 93.99 %.
R
install.packages("randomForest")
library(randomForest)
# Create a Random Forest model with default parameters
model1 <- randomForest(Condition ~ ., data = TrainSet, importance = TRUE)
model1
# Fine tuning parameters of Random Forest model
model2 <- randomForest(Condition ~ ., data = TrainSet, ntree = 500, mtry = 6, importance = TRUE)
model2
# Predicting on train set
predTrain <- predict(model2, TrainSet, type = "class")
# Checking classification accuracy
table(predTrain, TrainSet$Condition)
# Import the model we are using from sklearn.ensemble import RandomForestRegressor # Instantiate model with 1000 decision trees rf = RandomForestRegressor(n_estimators = 1000, random_state = 42) # Train the model on training data rf.fit(train_features, train_labels); # Use the forest's predict method on the test data predictions = rf.predict(test_features) # Calculate the absolute errors errors = abs(predictions - test_labels) # Print out the mean absolute error (mae) print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.') Mean Absolute Error: 3.83 degrees. # Calculate mean absolute percentage error (MAPE) mape = 100 * (errors / test_labels) # Calculate and display accuracy accuracy = 100 - np.mean(mape) print('Accuracy:', round(accuracy, 2), '%.') Accuracy: 93.99 %.
With excellent performance on all eight metrics, calibrated boosted trees were the best learning algorithm overall. Random forests are close second, followed by uncalibrated bagged trees, calibrated SVMs, and un- calibrated neural nets.
— Rich Caruana, Alexandru Niculescu-Mizil