RAETH Trading Eval

Evaluation Pipeline

Methodology

Five cascading stages using real Indian equity index data. Models must prove precise implementation before attempting creative alpha research.

0

Data Alignment

Merge NIFTY 50 + BANKNIFTY minute OHLC (737K bars, 2015-2022)

1a

SMA Crossover

Implement specified moving average strategy with known ground truth

1b

Z-Score Spread

Pairs trading on the NIFTY-BANKNIFTY spread with cooldown logic

1c

Portfolio Combination

Vol-normalized equal-weight combination of strategies

2

Alpha Research

Discover, backtest & combine hedged strategies. Scored on hidden 2023-2026 OOS data

Cascading Gate System

Early stage failures block all downstream scoring. Stage 0 failure yields -6.0; calibration errors in Stages 1a-1c yield -4.0 to -5.0. Only models passing all gates get scored on Stage 2 alpha.

Hidden Out-of-Sample Test Data

Stage 2 strategies scored on hidden 2023-2026 market data the model never sees. Runtime lookahead detection via truncation tests prevents data snooping.

80% Net Exposure Constraint

All positions must satisfy |net_notional| / gross_notional ≤ 0.80 at every bar. Forces genuine hedged trading, eliminating pure directional bets models default to.

Scoring System

Reward Structure

A cascading gate system where early failures block downstream scoring. Reward ranges from -6.0 (catastrophic) to +3.0 (exceptional alpha).

◆ Reward Scale — gated cascade from -6.0 to +3.0

Stage 0

Stage 1a/1b

Stage 1c

Stage 2: Out-of-Sample Sharpe

-6.0 -5.0 -4.0 -3.0 0.0 ★ +2.784 +3.0

-6.0

Stage 0 Fail

Data alignment incorrect

-5.0 to -4.0

Stage 1a/1b Fail

Strategy implementation errors (continuous)

-4.0 to -3.0

Stage 1c Fail

Portfolio combination errors (continuous)

-3.0

Solution Gates Fail

Lookahead, exposure violation, or missing docs

-3.0 to +3.0

Stage 2: OOS Sharpe

Annualized Sharpe on hidden 2023-2026 test data

Results

Model Rankings

Ranked by average reward across 5 independent trials (pass@5). Ground truth achieves Sharpe 2.784 on the same hidden test data.

Gemini 3.1 Pro achieved +0.419 — the only positive out-of-sample Sharpe across all 30 trials

All models scored negative on average — no model generated consistent positive alpha

#2

Gemini 3.1 Pro

Google

-2.724

Avg Reward

-6+3

Gemini 3.1 Pro

Google

Best Trial+0.419

Worst Trial-4.041

Std Dev1.671

Trials5

All Scores+0.42, -3.0, -3.0, -4.0, -4.04

#1

Claude Opus 4.6

Anthropic

-2.098

Avg Reward

-6+3

Claude Opus 4.6

Anthropic

Best Trial-0.328

Worst Trial-3.000

Std Dev1.218

Trials5

All Scores-0.33, -1.16, -3.0, -3.0, -3.0

#3

GPT-5.4

OpenAI

-3.000

Avg Reward

-6+3

GPT-5.4

OpenAI

Best Trial-3.000

Worst Trial-3.000

Std Dev0.000

Trials5

All Scores-3.0, -3.0, -3.0, -3.0, -3.0

Human Expert Solution

+2.784

Sharpe

Complete Results

Full Leaderboard

All 6 models with complete trial statistics. Click column headers to sort.

◆ All Models — cumulative results across 5 trials

Rank ▲	Model ▲	Avg Reward ▼	Best ▲	Worst ▲	Std Dev ▲

Analysis

Performance Analysis

Trial-level score distribution and model comparison across the reward scale.

Trial Score Distribution

Individual trial scores across the reward scale — hover dots for details

Trial Score

Average

Ground Truth

Average Performance Comparison

Average reward with trial distribution range

Trial Score

Range

Diagnostic

Stage Gate Pass Rates

Pass rate per stage across 5 trials. Cyan = 5/5, orange = 3-4/5, red = 0-2/5.

◆ Evaluation Gates — per model diagnostic

No Model Can Generate Alpha

Data Alignment

SMA Crossover

Z-Score Spread

Portfolio Combination

Alpha Research

Cascading Gate System

Hidden Out-of-Sample Test Data

80% Net Exposure Constraint