The Problem with Backtests
Most backtests overestimate future performance. Not because the underlying ideas are wrong, but because the simulation contains implicit assumptions that do not hold in live trading. The gap between backtest Sharpe and live Sharpe is one of the most reliable features of systematic trading — understanding it is the difference between a framework that generates real alpha and one that generates spreadsheet confidence.
The failures cluster into three categories: data problems, execution problems, and optimization problems.
Data Problems
Look-Ahead Bias
Look-ahead bias means using information that wasn’t available at the decision time. It is the most common and most damaging error.
Examples:
- Using end-of-day prices to determine signals computed “during” the day
- Adjusting historical prices for splits retroactively without tracking which adjustments were known at which time
- Including a company’s annual report data at the filing date rather than the public release date
Fix: every data point must be tagged with its availability timestamp. Signals can only use data available strictly before the decision timestamp. This requires careful data pipeline design — most data vendor feeds do not include availability timestamps by default.
Survivorship Bias
Backtests on “the current Nifty 50 constituents” include only stocks that survived to today. Stocks that went bankrupt, were delisted, or were dropped from the index are excluded — but they were part of the investable universe at the time.
Fix: use point-in-time index membership data. The backtest must use the actual historical composition, not the current one. For equity backtests, this requires a database that tracks historical index additions and deletions.
Data Snooping
If you tested 100 parameter combinations and selected the best-performing one, your reported performance reflects the best outcome of 100 experiments — not an unbiased estimate of the strategy’s true performance. With enough parameters to tune, any strategy can appear profitable in-sample.
Fix: define the full parameter space before running a single backtest. Reserve a true holdout period that you look at exactly once at the end, after all parameter selection is final. If you look at the holdout and then tune more, it is no longer a holdout.
Execution Problems
Slippage
Orders do not fill at the mid-price. In practice:
- Market orders fill at the offer (buy) or bid (sell) — paying the spread
- Large orders move the market against you (market impact)
- During volatile periods, the spread widens and fills are worse
A backtest that assumes mid-price fills overstates returns, particularly for strategies that trade frequently or in less liquid instruments.
Fix: model slippage explicitly. For index options: assume you pay half the bid-ask spread on entry and exit. For near-expiry weekly options, the spread widens significantly for OTM strikes — model this separately from ATM. Subtract slippage before evaluating strategy viability — if it doesn’t work with realistic slippage, it doesn’t work.
At Mastertrust, the backtesting framework included separate slippage models per instrument: options on Nifty weekly expiries versus monthly, and the difference between liquid ATM strikes and illiquid deep OTM strikes. The framework tracked realized slippage in live trading and fed it back to update the simulation parameters quarterly.
Transaction Costs
Commission costs compound over time. A strategy with 0.1% commission per trade and 500 trades per year loses 50% of gross alpha to commissions before any slippage. Many strategies that look profitable before costs are losers after.
Fix: model brokerage, exchange fees, STT, and GST explicitly per trade type (equity vs. F&O have different STT structures in India). Calculate break-even trading frequency — how many trades per year the strategy can support before costs exceed expected alpha.
Execution Latency
For medium-frequency strategies (hours to days), latency is usually ignorable. For intraday strategies, the assumption that you can execute at bar-close prices at decision time is wrong — by the time a signal fires and an order reaches the exchange, the price has moved.
Fix: simulate execution with a realistic delay. For rule-based strategies, assume next-bar open execution rather than current-bar close.
Optimization Problems
Curve Fitting
A strategy optimized to maximize Sharpe on 10 years of Nifty data will find the set of parameters that happen to work over that specific period — not the parameters that reflect a genuine market inefficiency. Add enough degrees of freedom and you can fit noise perfectly.
Heuristics for detecting curve fitting:
- Performance degrades sharply when parameters are nudged slightly (lack of robustness)
- Performance in the walk-forward window is much lower than in-sample
- The strategy has many more parameters than meaningful market states it is trading
Fix: evaluate parameter robustness before reporting performance. The optimization profile (performance vs. parameter value) should show a broad region of positive performance around the selected parameters, not a narrow spike. If the optimal value sits at a cliff edge, it will not survive live trading.
Walk-Forward Validation
Walk-forward testing is the standard for validating systematic strategies. The procedure:
- Divide the full data history into N windows
- Optimize the strategy on the first M windows (in-sample)
- Test the optimized parameters on window M+1 (out-of-sample)
- Roll forward: optimize on windows 1 through M+1, test on M+2
- Repeat until all windows are exhausted
The out-of-sample performance across all walk-forward windows is your realistic performance estimate. The ratio of walk-forward Sharpe to in-sample Sharpe is the walk-forward efficiency.
Empirical target: walk-forward efficiency ≥ 0.5. If in-sample Sharpe is 2.0 and walk-forward Sharpe is 0.8, you have a real strategy. If walk-forward Sharpe is 0.2, the in-sample result was mostly noise.
Two variants: anchored (in-sample always starts from the same date) and rolling (in-sample window moves forward with the same fixed length). Anchored is more conservative because it always uses more data; rolling is more realistic because it captures how the strategy would have been retrained in practice.
Monte Carlo Validation
Monte Carlo simulation addresses a different question: given this strategy’s historical trade distribution, what range of outcomes should you expect?
Input: the distribution of individual trade returns (wins, losses, their sizes)
Procedure: simulate 10,000 sequences of trades drawn randomly from this distribution (with replacement)
Outputs:
- Distribution of final equity outcomes
- Distribution of maximum drawdowns
- Risk of ruin (probability equity falls below a stop-trading threshold)
- Distribution of annual returns
The value is not the point estimate — it is the range. A strategy with median 30% annual return and 95th-percentile maximum drawdown of 45% requires different position sizing than one with the same median return and 95th-percentile drawdown of 15%.
Reporting Metrics That Matter
A complete backtesting report covers:
| Metric | What It Measures | Target |
|---|---|---|
| Total net profit | Absolute return | Positive |
| CAGR | Annualized compound growth | Depends on risk |
| Sharpe ratio | Return per unit of volatility | > 1.5 live, > 2 backtest |
| Calmar ratio | CAGR / Max Drawdown | > 2 |
| Profit factor | Gross wins / gross losses | > 1.5 |
| Tharp expectancy | Average R per trade | Positive |
| Maximum drawdown | Worst peak-to-trough decline | Context-dependent |
| Max drawdown duration | How long the worst drawdown lasted | < 6 months for most styles |
| Win rate | % of trades profitable | Depends on strategy |
| R² of equity curve | Linearity of growth | > 0.9 |
The equity curve shape matters as much as the metrics. Flat periods with no new equity highs are worse than drawdown periods of the same magnitude — they signal the strategy has stopped working, not just temporarily lost. An equity curve that makes new highs steadily with controlled drawdowns is more tradeable than one with higher peak returns but irregular patterns.
What the Framework Won’t Solve
Even a rigorous backtesting framework cannot eliminate regime risk: the risk that the market structure generating your historical edge no longer exists. HFT strategies from 2010 are not viable in 2024. Equity momentum strategies from the 2000s faced structural headwinds in the 2010s.
The correct response is not to extend the backtest further back — historical regimes are also not the current regime. The correct response is to understand why the strategy works: what market inefficiency or structural feature it is exploiting, and whether that feature is likely to persist. Understanding the mechanism is the only defense against regime change that no backtest can provide.
A Sharpe of 4 in backtest with a clear, persistent mechanism beats a Sharpe of 10 in backtest with no theoretical basis. The former you can trust when conditions change. The latter will surprise you.
A Note on Complexity
The most common mistake in backtesting framework design is adding complexity to solve problems that disciplined simplicity would solve first. Before building a sophisticated slippage model, verify there is no look-ahead bias. Before implementing Monte Carlo, verify the walk-forward efficiency is acceptable. Before optimizing parameters, verify the strategy has a clear, articulable mechanism.
The framework described here is not a checklist to implement all at once — it is a hierarchy of validity checks. Move to the next level only when the current level passes.