Forward testing, or incubation, is the process of demo trading your strategy. It is essentially a real-time out-of-sample test, and often provides a good estimate of your strategy’s live trading performance.
This article will help you decide whether your strategy needs to be forward tested, and discuss a few simple methods of evaluating your test results.
Should You Forward Test Your Strategy?
With a promising strategy at your fingertips, you probably have the urge to immediately start trading it with real money. After all, don’t you want to reap the monetary rewards of all your development time and effort?
Forward testing is a matter of personal preference. It certainly is the prudent thing to do, but it typically adds months to your development time. To help you decide, I recommend forward testing if your strategy has any of the following characteristics:
1. Contains Complex Rules
Complex strategies are difficult to program correctly. Even with the help of visual backtests, it is easy to overlook certain trading scenarios when debugging complex strategies. Sometimes you need to monitor your strategy in real-time before you can be confident it is bug-free.
Complex strategies are also more likely to be overfit to historical data. If your development workflow contains few robustness tests, or none at all, some real-time performance evaluation will help.
2. Uses Limit Orders or Exotic Bars
Strategies containing limit orders, or multiple orders which are only a few pips apart, are particularly sensitive to data feed and backtest engine limitations. Likewise, if your strategy uses exotic bar types like Heikin-Ashi, Renko or Kase bars, your backtest fills may be unreliable. Unless you have a good understanding of the inner workings of your backtest engine, it may be better to use these bar types for discretionary trading only.
If your strategy is particularly sensitive to incoming prices, it may be best to forward test it using a small live account instead. Demo trading may not be able to accurately simulate the liquidity of real markets.
3. Might Not Suit Your Trading Style
Forward testing gives you a chance to see the strategy in action. Historical backtests emphasize the outcome of trading a certain strategy, not the process. Without the benefit of hindsight, you may realize that trading your strategy in real time is emotionally unappealing.
For example, if you are testing a long-term trend strategy, you may realize you are uncomfortable with the low win rate and long holding periods. If you trade a short-term countertrend strategy, you may be constantly fearing that your next trade will be a large loser that erases a chunk of profits.
Emotional control is a vital component of successful trading, even if you use an algorithm. When developing your strategy, you have to keep cognitive biases at bay, and when trading your algorithm live, you need to resist the urge to interfere with its operation. Trading a strategy that is aligned with your preferences will help control your emotions. Forward testing can help determine whether your strategy suits you.
Evaluating Forward Testing Results
Your incubation results should help determine whether your backtest statistics are reliable. Comparing trading results is never a straightforward task; even two identical strategies can produce very different results when traded over different periods of time.
We will discuss three methods to estimate whether your incubation and backtest results are similar. To illustrate the use of these methods, one of my GBPJPY trend following strategies will be used. This strategy was developed using data from 2003-2019, and incubation was carried out from January-May 2020.
The figures below show the MT4 backtest equity curve and subsequent forward testing results monitored using Myfxbook.
First impressions are certainly not good, but are the test results bad enough to make me ditch the strategy? Let’s find out!
Equity Curve Comparison
For this method, the forward test equity curve is simply appended to the end of the backtest equity curve. If it is immediately obvious where the forward test period began, it is likely that your results are radically different.
For the GBPJPY strategy in question, both equity curves were recreated in Excel and combined, as shown below. All position sizes were normalized to 0.01 lots for a fair comparison.
The backtest equity curve is blue, while the forward test equity curve is orange. This portion of the equity curve appears tiny because there were only 33 trades over the 5-month forward test period, whereas there were 1233 trades over the 16-year backtest.
The last 100 trades of the combined equity curve have been magnified to highlight the strategy’s recent performance.
Are the results similar? If you were to compare the forward test results to segment 1 of the backtest equity curve, which was a very profitable series of trades, you’ll probably conclude that the strategy is broken.
But if you were to use segment 2 for comparison instead, the forward test results seem in line with expectations. It can thus be difficult to reach a meaningful conclusion through these subjective equity curve comparisons.
It is evident that we need a more objective and actionable method of evaluating our forward test results. The following sections will cover two statistical tests which may help us do this.
Some statistical rigour will be sacrificed in exchange for speed and simplicity. For our trading purposes, I believe this is a good compromise.
This statistical test can be used to determine whether two data samples contain the same mean. In our context, this means determining whether your trading results have a similar average trade. I prefer to measure the average trade in pips, rather than absolute dollars, since the pip value of each currency changes with time.
The input data for the t-test will be the pips gained/lost for each trade in your backtest and forward test history. By default, the MT4 backtest report shows each trade’s profit/loss in dollars; you can open the MT4 report in QuantAnalyzer to obtain the trade outcome in pips instead. The forward test results can be downloaded from Myfxbook in .csv format, which already contain a ‘pips’ column by default.
How many trades do you need in your forward test sample?
Sample size is a hotly debated topic in statistics, but most sources seem to agree that 30 trades should be sufficient for a reliable estimate.
30 trades only applies if you’re comparing trading results. If you’re backtesting a strategy for the first time, you’ll want as many trades as possible.
Before running the test, let’s introduce some essential statistical terms:
This is an initial assumption which states that both sets of trading results have the same average trade. Depending on the test result, we will decide whether to accept or reject this hypothesis.
Significance Level (Alpha Level)
This is the probability of incorrectly rejecting the null hypothesis. In our context, this means incorrectly deciding that the backtest and incubation results are different. We will use a 0.05 (5%) probability.
This is the probability of obtaining the results in your forward test sample, assuming that the null hypothesis is true. A smaller p-value means the test results are likely different.
Fortunately, the t-test is part of Excel’s Analysis ToolPak. You should select ‘t-Test: Two-Sample Assuming Unequal Variances’. Once you paste in your trading results and run the test as instructed in this video, your output should resemble that below.
The hypothesized mean difference is 0 because the null hypothesis states that both test samples have the same average trade.
Since the p-value of 0.19 is larger than the significance level of 0.05, we accept the null hypothesis.
We conclude that both sets of test results have the same average trade, and that the differences in trading results are due to chance only. This does not guarantee that the strategy will indeed perform as indicated in the backtest, but given the forward test results currently available, we do not have sufficient grounds to reject the strategy.
This statistical test can be used to determine whether your backtest and forward test results have a similar win rate. Unfortunately, Excel does not include this test by default, so some manual work is required.
For the chi-square test, the null hypothesis states that there is no significant difference between the observed and expected win rates during forward testing. The extent to which your observed and expected win rates differ can be quantified using the chi-square statistic, Χ2.
Let’s compute this statistic for our trend strategy above, then use it to determine whether to accept or reject the null hypothesis.
Step 1: Compute the expected number of wins from forward testing
The strategy’s historical tick backtest shows a 37% win rate. Applying this to our forward test basket of 33 trades, we can expect 12.21 wins and 20.79 losses. These are intermediate calculations; ignore the decimal values for now.
Step 2: Compute the chi-square test statistic
Obtain the number of wins observed during forward testing. For this case, we had 12 wins and 21 losses during the 5-month test.
Then compute the test statistic using the equation below. ‘O’ indicates the observed result, while ‘E’ indicates the expected result.
Every possible test outcome needs to be taken into account when computing the statistic. In our case, there are only 2 possible trade outcomes: winning and losing.
The values corresponding to these outcomes are then summed to give us 0.006. A small value means there is a strong similarity between your backtest and incubation win rates. We expected 12.21 wins over the 33 incubation trades, and we actually observed 12, which is exactly as expected. Of course, such a strong similarity is the exception, rather than the norm.
Step 3: Obtain the chi-square critical value for our test
This critical value is compared to the test statistic above to determine whether the null hypothesis should be accepted or rejected.
To get this value, we need to refer to the chi-square distribution table. Similar to the t-test above, we will use a significance level of 0.05.
We also need to determine the degrees of freedom for this test, which is simply equal to the number of possible outcomes minus 1. With 1 degree of freedom and a 0.05 significance level, we get a critical value of 3.84.
Step 4: Determine whether to accept or reject the null hypothesis
Since our chi-square test statistic of 0.006 is smaller than the critical value of 3.84, we accept the null hypothesis that the backtest and forward test results have the same win rate. Any variability in win rate is likely due to chance alone.
The chi-square computations above assume that the expected frequency for each trade outcome is at least 5. Since we expected 12 wins and 21 losses from the forward test sample, any inaccuracies should be minimal. If this assumption does not hold, Yate’s correction should be applied.
What if Forward Testing Fails?
Even the best strategies have unprofitable periods. Depending on your trading timeframe, this could range from days to even years. It is entirely possible that your forward test occurred under unfavourable market conditions, and poor results do not necessarily indicate that your strategy is fundamentally broken.
Nonetheless, if a sufficiently large forward test sample is available, my personal preference is to discard the strategy and focus on developing others. I like to maintain a farm of decent strategies, rather than focus on a small handful of great strategies, since there is no way of knowing which strategies will underperform/outperform in the near future.
Alternatively, you can continue forward testing the strategy to see if its performance improves.
Forward testing your strategy is a matter of personal preference, but is especially useful if your strategy is complex or highly sensitive to incoming prices. It is the most accurate way to test your strategy without having real money on the line.
Test results can be evaluated visually by doing an equity curve comparison, or mathematically through t-tests and chi-square tests. These can easily be done using common software such as Excel. Using all three methods concurrently will likely yield the most reliable results.
If your strategy passes forward testing, it is probably fit for live trading, ideally as part of a diverse portfolio. Portfolio composition will be discussed next.