Practical Random Forest Implementation


We discussed decision trees and random forests in quite a lot of detail here. This article will take you through a practical implementation, where based on historic data, we aim to predict future weather. The data for this model is continuous & hence requires a regression model, rather than a discrete classification model.

Remember, random forests are supervised machine learning models. They’re supervised, because we give them the ‘truth’ in order to help them learn.

This post follows the great tutorial from I have tried to comment the code thoroughly to make it easy to follow.

The machine learning process:

  1. Acquire data & wrangle it into the correct format
  2. Complete data cleansing – remove outliers, fill null values, correct anomalies etc…
  3. Prepare data to be consumed by the model (in this case, convert to Numpy arrays)
  4. Calculate baseline (how accurate would it be if we took historic data, with no machine learning?). We need to ensure the output of our model is more accurate.
  5. Load training data (with labels (actual results)) into the model and train it
  6. Make predictions against test data & compare predictions versus actuals
  7. Compare to baseline to make sure your model is more accurate than simply using averages from historic data
  8. Adjust the model / get more training data / try a different model if performance is not good enough

Random Forest Use-Cases:

  • Loan approvals (categorical)
  • Fraud detection (categorical)
  • Disease detection (categorical)
  • Stocks, expected profit / loss based on market conditions (regression)
  • Product recommendation (categorical)

Machine learning lingo:

  • Hyper parameter tuning: adjusting the settings to improve performance

The Code: