What Is Backtesting? Why Most Are Broken

What Backtesting Is

A backtest takes a set of encoded trading rules — when to enter, when to exit, how much to size — and simulates those rules against historical market data. The output is a hypothetical performance record: equity curve, win rate, drawdown, Sharpe ratio, and a ledger of every trade the strategy would have taken.

The purpose is narrow. A backtest tells you whether a rule would have worked on the specific dataset it was tested against. Treated as such, it's the first credible filter: strategies with no historical edge get thrown out before a single real dollar is committed. Treated as a forecast, it lies.

The U.S. Commodity Futures Trading Commission recognizes this asymmetry directly. CFTC Rule 4.41 requires that any presentation of hypothetical trading results — which includes backtests — carry an explicit disclosure that hypothetical results do not represent actual trading and are inherently limited.

How Backtests Lie

Four failure modes account for the overwhelming majority of backtest-to-live disappointment. Each has been documented in the academic record and each is avoidable in principle.

1. Overfitting (data-snooping bias)

Overfitting is what happens when the parameters of a strategy are tuned until the strategy matches the noise in the sample. Bailey, Borwein, López de Prado, and Zhu formalized this as the probability of backtest overfitting, and showed that under even modest parameter search — trying, say, a dozen different moving-average lengths — a strategy with a true expected return of zero can be made to look profitable with high probability.

The warning sign is a backtest that looks too clean: smooth equity curve, high Sharpe, no ugly drawdowns. Markets are noisy. Backtests that are not noisy have typically had their noise optimized away.

2. Look-ahead bias

A look-ahead bias uses information at time t that would not have been known until time t + k. The classic example: using a session's closing value to make a decision that supposedly happened during the session. Another: using adjusted close prices — which are revised in response to later splits and dividends — as if they were the prices available at the original timestamp.

Look-ahead bias silently inflates performance. A backtest with a subtle look-ahead can show a Sharpe above 3 on data that, traded cleanly, produces nothing. Detection requires careful audit of every data point the strategy reads.

3. Transaction-cost omission

A backtest that ignores commissions, exchange fees, and slippage is measuring a different system than the one that will trade live. For a retail futures strategy with a round-trip cost of $4.50 per contract plus one tick of slippage, high-turnover systems can lose most or all of their gross edge to cost.

A reasonable rule of thumb: if a system's gross backtest result barely beats costs, it has no real edge. If the system's profitability is entirely a function of ignoring costs, it will lose money live.

4. Insufficient sample

A strategy tested on a single market regime, a single instrument, or fewer than roughly 100 trades is statistically underpowered. The question is whether the observed performance is distinguishable from luck, and with a small sample, it usually isn't. A 15-trade backtest with a 65% win rate tells you essentially nothing about the system's true expected win rate.

What Good Backtesting Looks Like

The standard corrections are well documented in the literature and easy to describe, if not always easy to execute.

Out-of-sample testing. A portion of the historical data — typically the most recent 20–30% — is held out and not used during parameter selection. The strategy is tested against that held-out data as the final validation. A large drop in performance between the in-sample and out-of-sample segments is strong evidence of overfitting.

Walk-forward analysis. Instead of a single train/test split, the data is walked forward: the strategy is optimized on a rolling window, tested on the subsequent period, then re-optimized on a window that includes that period, and so on. This simulates the act of re-calibrating a live strategy over time and is far more informative than a one-shot split.

Realistic costs. Commissions, fees, and slippage are built into the simulation at conservative levels. For futures, a retail-realistic slippage model is typically one tick worse than midpoint for market orders.

Sufficient sample. The standard quantitative finance heuristic is that a strategy should have at least several hundred trades across multiple market regimes before any stable inference about expected performance is possible.

What a Backtest Is Not

A backtest is not proof of future performance. It is not evidence that the rules "work." It is not a substitute for live trading. At its best, a backtest is a filter — a way of rejecting ideas that clearly have no historical edge — and a sanity check on the mechanics of execution. It is a necessary step in strategy development. It is not, by itself, a sufficient one.

The distinction matters in a regulatory sense as well: the reason CFTC Rule 4.41 exists is because hypothetical results historically misrepresented what actually trading a strategy produces. The rule is not pro-forma. It is a response to a documented pattern of retail investors treating backtests as forecasts.

Conclusion

Backtesting is a powerful tool and a common failure point. The difference between a backtest that tells you something true and a backtest that tells you what you want to hear is almost entirely a matter of methodology: out-of-sample discipline, realistic costs, adequate sample size, and skepticism about results that look too clean to be real.

The research literature on this is clear and has been for at least two decades. Most of the strategies that fail live did not fail because the market changed. They failed because the backtest never measured what their developer thought it was measuring.

Disclaimer: FalcoAlgo is a software product of Falco Systems LLC and is not a registered investment adviser. This article is for educational purposes only and does not constitute investment, trading, tax, or legal advice. Futures trading involves substantial risk of loss. Hypothetical performance results have inherent limitations and are not indicative of future results.

Falco Insights Editorial

Falco Insights publishes research-grade, non-promotional educational content on algorithmic trading, quantitative strategies, futures market structure, risk management, and systematic execution. Every article is source-verified against academic, regulatory, and exchange primary sources.

What Is Backtesting? And Why Most Backtests Are Broken.