Building a Systematic Options Backtesting Framework

Overview

At Mastertrust, I led the development of the firm’s quantitative research infrastructure from scratch. The core piece was a systematic backtesting framework that could honestly evaluate options strategies — incorporating execution costs, slippage, capital constraints, and regime sensitivity — not just raw P&L on historical fills. The result: a Sharpe ratio of 4 on live index options strategies managing portfolios exceeding ₹100 crore.

The Problem

When I joined, strategy evaluation was informal. A strategy “worked” if the last few months looked good. There was no walk-forward validation, no overfitting score, no slippage model, no capital efficiency metric. The feedback loop between research and live performance was broken — strategies were approved on the basis of in-sample patterns that had no out-of-sample predictive power.

Why It Mattered

In options trading, the cost of an overfitted strategy is immediate and measurable. A strategy that looks great on paper but loses money live doesn’t just cost P&L — it costs confidence in the research process, erodes capital, and creates pressure to keep adjusting until you’ve completely destroyed the original edge. The framework needed to make the overfitting problem visible before capital was at risk.

Data & Inputs

Multi-terabyte options market data: full order book, tick-by-tick option chains for Nifty and BankNifty
Implied volatility surfaces computed from real-time and historical option prices
Open interest data for sentiment and positioning signals
Transaction cost data: brokerage, exchange fees, STT, stamp duty — all precise
Historical regime data: VIX levels, realized volatility, event calendars

Approach

The framework was built around three core principles:

Walk-forward only. Every strategy was evaluated using expanding-window or rolling-window walk-forward splits, never in-sample on the full history. This made the overfitting problem structural rather than a discipline issue.

Parameter stability scoring. Strategies were scored not just on peak Sharpe but on how sensitive that Sharpe was to small parameter perturbations. A strategy that requires precise parameter values is fragile; a strategy that works across a range of parameters has a real edge.

Execution-realistic simulation. Fills were simulated with bid-ask spread costs, market impact, latency delays, and position sizing constraints. The difference between theoretical P&L and realistic P&L was tracked explicitly.

I deliberately rejected black-box optimization — every signal and parameter had a qualitative reason for existing before it was tested quantitatively.

Engineering & Implementation

The core architecture:

Data layer: PostgreSQL for clean historical OHLCV and options chain data, Redis for hot data during simulation runs
Signal engine: modular signal library — each signal a pure function with documented edge hypothesis
Backtesting engine: vectorized NumPy simulation for speed, with explicit transaction cost application at each fill
Walk-forward engine: rolling and expanding window implementations with configurable train/test ratios
Overfitting score: custom metric measuring Sharpe stability across parameter grid — penalizes parameter sensitivity
Regime tagger: classifies each day into high/low/transitional volatility regime using IV surface features
Risk engine: per-strategy max drawdown limits, correlation-adjusted position sizing, capital allocation across books
Monitoring: Grafana dashboards showing live vs. backtest P&L, drawdown, regime distribution, signal contribution

The ML components (LSTMs, transformer-style models for IV forecasting) were integrated as signals, not as the strategy itself — the framework could evaluate any signal source.

Results & Impact

Sharpe ratio of 4 in live systematic index options strategies
Portfolios managed exceeding ₹100 crore AUM
Walk-forward out-of-sample Sharpe consistently within 15% of in-sample estimates — validation that the framework was honest
Regime-adaptive deployment: strategies automatically de-risked during high-volatility regimes
Full team adoption: 3 quantitative researchers using the same framework infrastructure

Limitations & What I’d Do Differently

The framework handles single-leg and spread strategies well but becomes computationally expensive for multi-leg exotic structures. If building from scratch again, I’d design the fill simulation layer to be parallel from the start — sequential simulation of complex Greeks scenarios is a bottleneck at scale.

The regime detection model is rule-based and works well in practice, but a learned regime classifier with probabilistic outputs would be more robust to novel market conditions.

Stack

Python, NumPy, Pandas, PyTorch (signal models), QuantLib (Greeks and pricing), PostgreSQL, Redis, Grafana, custom backtesting engine

Overview

The Problem

Why It Mattered

Data & Inputs

Approach

Engineering & Implementation

Results & Impact

Limitations & What I’d Do Differently

Stack

Related Writing

Stack

Lets collaborate!