We all want a large sample of trades in our backtests, but practical limitations such as data availability often get in the way.
Here I’ll explain why 30 trades is insufficient, and how you can use standard error to quantify the uncertainty arising from a small sample size.
Browsing through the MQL5 Marketplace is a fun way to discover the many types of trading algorithms in existence.
I have seen backtests containing anywhere from 5 to 5000 trades.
So how many trades (or sample size) does your backtest really need?
And if you suspect your backtest has insufficient trades, can you still make use of its results?
The Importance of Sample Size
A large number of trades increases the statistical significance of your backtest results.
In essence, this means you can be confident that your results are a true reflection of your strategy’s performance, and are not due to chance.
Since your backtest is the crucial ‘scorecard’ that accompanies your strategy from inception to live trading, you need to be sure you can trust it.
In addition, targeting a larger sample size often forces you to backtest your strategy over a longer historical period. This lets you evaluate your strategy over different market conditions, and gives you a better idea of its robustness.
Are 30 Trades Enough?
When it comes to statistical significance, the number 30 gets plenty of attention.
When you backtest your strategy, you are attempting to characterize its probability distribution, as statisticians like to say.
30 trades is usually sufficient if you’re trying to verify a distribution you have already characterized.
For example, you have a basket of 30 live trades, and you want to see how these compare to your backtest performance.
You could use a Student’s t-test or a chi-square test to verify that both sets of trades come from the same distribution. I demonstrate the use of these simple statistical tests during the forward testing phase of strategy development.
However, if it’s your first attempt at characterizing the distribution, 30 trades is woefully insufficient.
I’ll illustrate this with a common analogy in probability theory.
The Sock Analogy
Imagine you receive a large barrel of socks with the following label: 50% black, 50% white socks.
You start drawing socks from the barrel, one at a time. After 30 draws, you have 17 black socks and 13 white socks.
You conclude that the label is correct.
Suppose you now receive an unlabelled barrel of socks, and you have totally no idea what’s inside.
Would 30 draws allow you to confidently describe the contents of the barrel? Probably not!
Your strategy’s average profit/loss, win rate, stagnation etc., are all important metrics that your backtest should tell you.
It’s quite impossible to characterize a whole bunch of metrics with such a small sample size. Just look at the wall of socks below!
So How Many Trades? The Short Answer…
The more the merrier.
Obviously this answer is not particularly useful or actionable. You could face practical limitations regarding sample size for the following reasons:
- It is difficult to find quality historical data before the year 2000
- You trade on the higher timeframes
- You allocate some data for out-of-sample testing
It could be a mistake to discard a promising strategy simply because it has too few trades. After all, successful trading is about making the right trade-offs.
We need a way to quantify the deterioration of backtest results arising from a small sample size.
Fortunately, statisticians have solved this dilemma for us, using a concept called standard error.
I’ll explain standard error below, then demonstrate its application using two very different backtests.
What is Standard Error?
Standard error measures the accuracy of your sampling process. It helps you gauge how reliable your backtest results are.
You can apply the standard error to any statistic, but for our trading purposes, we’ll use the mathematical expectancy (also called the average trade or the mean) of the backtest.
A trader backtesting a strategy is like a statistician sampling a population to determine some underlying parameter. To better understand what standard error means, let’s first discuss how it is used in statistics.
A Short Statistics Excursion
Imagine you want to determine the average height of 30-year old males in a country. This height would the parameter of interest. Its true value is unknown, but you hope to get a good estimate of it through the sampling process.
So you go about collecting 5 samples of data throughout the country, each consisting of 10 data points. You get the following plot:
To get the standard error, you do the following:
- Compute the mean height from each of the 5 samples
- Compute the standard deviation of these 5 means. Excel’s STDEV function can help with this.
The standard error is equal to the standard deviation of these sample means.
Standard error measures the sample-to-sample variability of the means, and tells you how far the sample mean deviates from the true population mean. The smaller the standard error, the more representative the sample will be of the overall population.
Coming back to our trading context, a small standard error means our backtest expectancy will be close to the ‘true’ value that we would obtain if we could backtest over infinite data.
How Do We Calculate Standard Error?
In trading, we don’t have the luxury of having multiple samples as shown above. We only have one, and that’s our backtest.
Fortunately, standard error can be estimated using the simple formula:
The more trades your backtest has, the smaller the standard error.
To get the standard deviation value, you will need the profit/loss from each individual trade in your backtest. You can easily get this value by exporting your MT4 backtest report to Excel.
- Apply Excel’s data filter function to the Type column. Remove all rows that have empty cells in the Profit column.
- Use Excel’s STDEV function to calculate the standard deviation of the individual trade profits.
How Do We Use Standard Error?
Thanks to the central limit theorem, it is usually safe to assume that the profits/losses in your backtest are normally distributed. In other words, they follow the famous ‘bell curve’ shown below:
In a normal distribution,
- 68.3% of values lie between ±1 standard deviation
- 95.4% of values lie between ±2 standard deviations
- 99.7% of values lie between ±3 standard deviations
If you have a backtest with an expectancy of $100 and a standard error of $20, you can make use of the above information to estimate the following:
- You can be 68.3% confident that your strategy’s true expectancy lies between $80 and $120 (±1 standard error)
- You can be 95.4% confident that your strategy’s true expectancy lies between $60 and $140 (±2 standard errors)
- You can be 99.7% confident that your strategy’s true expectancy lies between $40 and $160 (±3 standard errors)
That’s standard error in a nutshell for you. Some statistical rigour was sacrificed to arrive at the statements above, but trading is a practical moneymaking endeavour, not an academic exercise.
Examples of Standard Error Application
I’ll use two backtests to demonstrate the application of standard error. Both are from trend following strategies trading 0.1 lots throughout the backtest.
The first strategy trades on the 15-minute timeframe.
The second strategy trades on the 4-hour timeframe.
At first glance, the second strategy seems far more promising. Expectancy and profit factor are significantly higher, although the backtest only has 290 trades. Let’s see if that causes problems.
I did the following for each strategy:
- Compute the standard error using the procedure described above
- Compute the expectancy ± 2*standard errors
Fortunately, both backtests still yield a positive expectancy after subtracting 2 standard errors.
But look at how similar their ‘adjusted’ expectancies are. Standard error has exposed the backtest uncertainty arising from a small sample of trades for the H4 strategy.
This is an unfortunate reality of trading on the higher timeframes. Such strategies typically contain larger wins and losses, giving a larger standard deviation of individual trade results. Compound this with the smaller number of trades, and you get a standard error that can really erode your backtest expectancy.
Suppose you’re trading a daily trend following strategy that was developed on a small sample of trades, and it starts underperforming. Market conditions could have changed, or perhaps your backtest uncertainty is playing out in real time.
Wrapping Up
A large sample of trades minimizes the effects of luck, and helps ensure you’re discovering the true performance profile of your strategy.
Rather than establish a minimum acceptable number of trades, consider using the standard error to quantify your backtest’s uncertainty arising from a small sample of trades.
If your backtest’s expectancy is still positive after subtracting twice the standard error, it’s likely your strategy will be profitable over the long-term.
Unfortunately for part-time retail traders, backtests of higher-timeframe strategies often contain large standard errors. Trading on the lower timeframes could alleviate this. (Hint: Algorithmic trading will come in handy.)
How many trades do you like to see in your backtest? Let me know in the comments!
Hello,
Great article and work! I am trying to figure out if I followed correctly your instructions to calculate your application of standard error on my backtest and live test.
BACKTEST (10 years): 893 trades, standard error(standard deviation 4.057/893 = 0.004, payoff ratio 1.53
Expectancy -3*SE: 1.53 – 0.012 = 1.38
LIVE TEST (3 months): 30 trades, standard error(standard deviation 3.15/30 = 0.105, payoff ratio 1.50
Expectancy -3*SE: 1.50 – 0.315 = 1.18
Did I do this right? Does it look like a good system?
Thanks a lot for your valuable feedback.
Hi John! Glad you enjoyed the article. Here’s my 2 cents:
1) To find the standard error, you need to divide the standard deviation by the square root of the number of trades (not simply the number of trades). For your backtest example above, you should divide by square root of 893, giving you a standard error of 0.14.
2) I recommend applying standard error to your backtest results, as a way to determine how reliable your backtest statistics are. If you do this for your live results, you will usually get a very large standard error because of the small number of trades. If you want to verify whether your live results match your backtest, you can apply simple statistical tests like the t-test or chi-square test. I detail the steps in my Incubation article: https://tradingtact.com/forward-testing/
3) Is it a good system? Using your backtest results, Expectancy – 3*SE = 1.53 – 0.41 = 1.12. Looks decent to me. If your live results match your backtest results, you should be doing good.
On a side note, I don’t think payoff ratio = expectancy. Payoff ratio is the average winner/average loser.
Hope that helped!
What if you do not know the expected value of your strategy and that is what you are trying to find out? Is it just what you think you should be getting per trade?
Hello Joseph,
The expected value (Net profit/# trades) is a very common metric and I believe every backtest engine should compute it for you. In MT4, it is called the expected payoff.
I wouldn’t rely on gut feeling because expectations often differ from reality in trading.
Hey Wayne, Awesome work. Thanks a lot. It was really helpful in solving the dilemma. I wanted to know if I have only 25 trades in backtest as I trade on weekly charts, is this standard error metric still helpful.
Thanks
Hi Rajiv, standard error would still be helpful. Such a small sample size would produce a large standard error, which illustrates the high level of uncertainty in your backtest.
Thanks a lot Wayne for not only giving a good content but also responding so promptly…I would like to take your help in future for my system developments and also will be happy to spread a good word 🙂
I tested almost 290 markets on 20-30 years data yet most of them are producing only 20-80 trades because of system parameters being designed for longer holding time or other reasons like data being available from 2021 such a case of Robinhood
These are my results on 2H timeframe, However I am sticking to your logic of being positive after Expectance-2*SE
Example: Trades 35/PF 3.2/GtPR 2.2/Expectancy 2.17%/SE 0.91%
2.17%- 1.82%( 0.91 SE*2)=0.35%
Pls advise
Thanks
I think it would be more appropriate to use $ for expectancy. If it’s still positive after subtracting 2SE, you should be good to go.
Thanks a lot for your help, Wayne.🙏
Hello Wayne,
Thank you for all the useful content and value that you are providing.
I would like to ask:
– do you include commision+swap when calculating expectancy and STDEV of P/L ? And what is your reasoning behind it ?
Thanks for the answer
Hi David,
I always include spread, slippage and commissions when backtesting. These are unavoidable costs of trading, so it’s best to include them always.
Strategyquant does not have a swap input, so I don’t include that for now. I am quite conservative with my spread and slippage (2 pip spread and 1 pip slippage for liquid ECN markets), so that usually compensates for the lack of swap.
this is a tricky topic to cover. good attempt. i find that higher trade numbers as you say come with intraday strategies and so particulary in fx dont always mean its robust, but having said this id look for around 1000. then i like to see them also perform on other markets, which really gives you an indication of robustness as i have some strategies giving 5000 trades when looking at a basket of assets.
thanks for your input!