We all want a large sample of trades in our backtests, but practical limitations such as data availability often get in the way.
Here I’ll explain why 30 trades is insufficient, and how you can use standard error to quantify the uncertainty arising from a small sample size.
Browsing through the MQL5 Marketplace is a fun way to discover the many types of trading algorithms in existence.
I have seen backtests containing anywhere from 5 to 5000 trades.
So how many trades (or sample size) does your backtest really need?
And if you suspect your backtest has insufficient trades, can you still make use of its results?
The Importance of Sample Size
A large number of trades increases the statistical significance of your backtest results.
In essence, this means you can be confident that your results are a true reflection of your strategy’s performance, and are not due to chance.
Since your backtest is the crucial ‘scorecard’ that accompanies your strategy from inception to live trading, you need to be sure you can trust it.
In addition, targeting a larger sample size often forces you to backtest your strategy over a longer historical period. This lets you evaluate your strategy over different market conditions, and gives you a better idea of its robustness.
Are 30 Trades Enough?
When it comes to statistical significance, the number 30 gets plenty of attention.
When you backtest your strategy, you are attempting to characterize its probability distribution, as statisticians like to say.
30 trades is usually sufficient if you’re trying to verify a distribution you have already characterized.
For example, you have a basket of 30 live trades, and you want to see how these compare to your backtest performance.
You could use a Student’s t-test or a chi-square test to verify that both sets of trades come from the same distribution. I demonstrate the use of these simple statistical tests during the forward testing phase of strategy development.
However, if it’s your first attempt at characterizing the distribution, 30 trades is woefully insufficient.
I’ll illustrate this with a common analogy in probability theory.
The Sock Analogy
Imagine you receive a large barrel of socks with the following label: 50% black, 50% white socks.
You start drawing socks from the barrel, one at a time. After 30 draws, you have 17 black socks and 13 white socks.
You conclude that the label is correct.
Suppose you now receive an unlabelled barrel of socks, and you have totally no idea what’s inside.
Would 30 draws allow you to confidently describe the contents of the barrel? Probably not!
Your strategy’s average profit/loss, win rate, stagnation etc., are all important metrics that your backtest should tell you.
It’s quite impossible to characterize a whole bunch of metrics with such a small sample size. Just look at the wall of socks below!
So How Many Trades? The Short Answer…
The more the merrier.
Obviously this answer is not particularly useful or actionable. You could face practical limitations regarding sample size for the following reasons:
- It is difficult to find quality historical data before the year 2000
- You trade on the higher timeframes
- You allocate some data for out-of-sample testing
It could be a mistake to discard a promising strategy simply because it has too few trades. After all, successful trading is about making the right trade-offs.
We need a way to quantify the deterioration of backtest results arising from a small sample size.
Fortunately, statisticians have solved this dilemma for us, using a concept called standard error.
I’ll explain standard error below, then demonstrate its application using two very different backtests.
What is Standard Error?
Standard error measures the accuracy of your sampling process. It helps you gauge how reliable your backtest results are.
You can apply the standard error to any statistic, but for our trading purposes, we’ll use the mathematical expectancy (also called the average trade or the mean) of the backtest.
A trader backtesting a strategy is like a statistician sampling a population to determine some underlying parameter. To better understand what standard error means, let’s first discuss how it is used in statistics.
A Short Statistics Excursion
Imagine you want to determine the average height of 30-year old males in a country. This height would the parameter of interest. Its true value is unknown, but you hope to get a good estimate of it through the sampling process.
So you go about collecting 5 samples of data throughout the country, each consisting of 10 data points. You get the following plot:
To get the standard error, you do the following:
- Compute the mean height from each of the 5 samples
- Compute the standard deviation of these 5 means. Excel’s STDEV function can help with this.
The standard error is equal to the standard deviation of these sample means.
Standard error measures the sample-to-sample variability of the means, and tells you how far the sample mean deviates from the true population mean. The smaller the standard error, the more representative the sample will be of the overall population.
Coming back to our trading context, a small standard error means our backtest expectancy will be close to the ‘true’ value that we would obtain if we could backtest over infinite data.
How Do We Calculate Standard Error?
In trading, we don’t have the luxury of having multiple samples as shown above. We only have one, and that’s our backtest.
Fortunately, standard error can be estimated using the simple formula:
The more trades your backtest has, the smaller the standard error.
To get the standard deviation value, you will need the profit/loss from each individual trade in your backtest. You can easily get this value by exporting your MT4 backtest report to Excel.
- Apply Excel’s data filter function to the Type column. Remove all rows that have empty cells in the Profit column.
- Use Excel’s STDEV function to calculate the standard deviation of the individual trade profits.
How Do We Use Standard Error?
Thanks to the central limit theorem, it is usually safe to assume that the profits/losses in your backtest are normally distributed. In other words, they follow the famous ‘bell curve’ shown below:
In a normal distribution,
- 68.3% of values lie between ±1 standard deviation
- 95.4% of values lie between ±2 standard deviations
- 99.7% of values lie between ±3 standard deviations
If you have a backtest with an expectancy of $100 and a standard error of $20, you can make use of the above information to estimate the following:
- You can be 68.3% confident that your strategy’s true expectancy lies between $80 and $120 (±1 standard error)
- You can be 95.4% confident that your strategy’s true expectancy lies between $60 and $140 (±2 standard errors)
- You can be 99.7% confident that your strategy’s true expectancy lies between $40 and $160 (±3 standard errors)
That’s standard error in a nutshell for you. Some statistical rigour was sacrificed to arrive at the statements above, but trading is a practical moneymaking endeavour, not an academic exercise.
Examples of Standard Error Application
I’ll use two backtests to demonstrate the application of standard error. Both are from trend following strategies trading 0.1 lots throughout the backtest.
The first strategy trades on the 15-minute timeframe.
The second strategy trades on the 4-hour timeframe.
At first glance, the second strategy seems far more promising. Expectancy and profit factor are significantly higher, although the backtest only has 290 trades. Let’s see if that causes problems.
I did the following for each strategy:
- Compute the standard error using the procedure described above
- Compute the expectancy ± 2*standard errors
Fortunately, both backtests still yield a positive expectancy after subtracting 2 standard errors.
But look at how similar their ‘adjusted’ expectancies are. Standard error has exposed the backtest uncertainty arising from a small sample of trades for the H4 strategy.
This is an unfortunate reality of trading on the higher timeframes. Such strategies typically contain larger wins and losses, giving a larger standard deviation of individual trade results. Compound this with the smaller number of trades, and you get a standard error that can really erode your backtest expectancy.
Suppose you’re trading a daily trend following strategy that was developed on a small sample of trades, and it starts underperforming. Market conditions could have changed, or perhaps your backtest uncertainty is playing out in real time.
A large sample of trades minimizes the effects of luck, and helps ensure you’re discovering the true performance profile of your strategy.
Rather than establish a minimum acceptable number of trades, consider using the standard error to quantify your backtest’s uncertainty arising from a small sample of trades.
If your backtest’s expectancy is still positive after subtracting twice the standard error, it’s likely your strategy will be profitable over the long-term.
Unfortunately for part-time retail traders, backtests of higher-timeframe strategies often contain large standard errors. Trading on the lower timeframes could alleviate this. (Hint: Algorithmic trading will come in handy.)
How many trades do you like to see in your backtest? Let me know in the comments!