Recently I learned about a machine learning method, the Random Forest classifier. It essentially predicts categories by combining the decisions of many smaller models called decision trees.
The model creates many decision trees and uses the features we provide to narrow down the decision and come up with a final classification.
Like any other ML model, we give it enough training data and a test dataset to evaluate the accuracy of the model.
The pipeline
- Create many trees. Each one casts a vote.
- At the end, each tree has an answer given the features we provided in the training data.
- Add all the decisions together and take the final vote according to the category with the most votes.
Let’s build it in Python now. I worked on 5 stocks from the Magnificent Seven. For this project I used 5 years of historical prices and defined multiple features.
I then analyzed the correlations between each feature and came to the conclusion that Returns, Volatility, and RSI were the least correlated with each other, allowing us to avoid overfitting.
I then split the data between training and testing and used the commands from the libraries we imported initially.
The model was defined as RandomForestClassifier(n_estimators=100, max_depth=7, random_state=42),
then trained with model.fit(X_train, y_train).
Let’s break down these commands:
- n_estimators = 100: the model will have 100 trees.
- max_depth = 7: controls how complex each tree can get.
- random_state = 42: fixes randomness for reproducibility. Everyone uses 42 for some statistical reason. I still don’t fully understand it.
- model.fit(X_train, y_train): basically using the model.
Looking at the accuracy of each stock, they are all above 50% except for GOOGL. This could probably show there is an edge here, but nonetheless we must analyze further.
To understand if there is any real value here, I analyzed the summary statistics of each ticker individually.
The results were not the best. Extremely low R-squared and very low p-values for our predictor values.
However, each stock did have a predictor that was more interesting than the others.
For example, the feature AAPL_Return had a p-value of 0.109, which made it one of the more relevant signals in the model.
It was now time to backtest the strategy. Rules.
- Random Forest decisions of 58% and above were considered. Every time it triggered, we entered a position.
- Start date: 2021-01-01 / End date: 2026-04-10.
- 80% training data / 20% testing data.
- Testing period: March 2025.
- The benchmark: SPY buy and hold.
We beat the market, we are geniuses. During these years, the market regime was quite the same. Therefore the data the model trained on was a good replicate of what would happen from 2025 onwards. The 5-year window captures a modern regime. It includes post-COVID volatility, high interest rates, and the AI tech concentration.
The model achieved a return of:
- Random Forest return: 23.45%
- Buy and Hold: 19.51%
However, when we extend the backtest to 10 years, the model breaks. The regime changes were too many and the model trained on data that was not replicative of future market movements. During 2015 to 2019, the market was a low-interest-rate playground. No volatility and the buy-the-dip was a no-brainer.
By giving the model this market, we were basically training it on a regime that doesn’t exist anymore.
- Random Forest return: 32.63%
- Buy and Hold: 33.37%
Conclusions, flaws and biases
In statistics I always think more data is always better. In trading, relevant data is crucial.
- 5 years: Model wins. Optimized for the past bull market.
- 10 years: Model fails. Using outdated patterns and old regimes.
This is a situation where adding more features to the model will not change or help the results given the reasons mentioned.
In the future I will try to understand market regimes and how to train a model to accurately predict those regimes. Understanding macro will be crucial.
Enjoy the walk.