RAETH AI ◆ TRADING EVAL

No Model Can Generate Alpha

Six frontier AI models. Thirty independent trials (pass@5). A rigorous multi-stage quantitative trading benchmark with hidden out-of-sample data.

0
Frontier Models
0
Total Trials
2.784
Ground Truth Sharpe
Evaluation Pipeline
Methodology

Five cascading stages using real Indian equity index data. Models must prove precise implementation before attempting creative alpha research.

0

Data Alignment

Merge NIFTY 50 + BANKNIFTY minute OHLC (737K bars, 2015-2022)

1a

SMA Crossover

Implement specified moving average strategy with known ground truth

1b

Z-Score Spread

Pairs trading on the NIFTY-BANKNIFTY spread with cooldown logic

1c

Portfolio Combination

Vol-normalized equal-weight combination of strategies

2

Alpha Research

Discover, backtest & combine hedged strategies. Scored on hidden 2023-2026 OOS data

Cascading Gate System

Early stage failures block all downstream scoring. Stage 0 failure yields -6.0; calibration errors in Stages 1a-1c yield -4.0 to -5.0. Only models passing all gates get scored on Stage 2 alpha.

Hidden Out-of-Sample Test Data

Stage 2 strategies scored on hidden 2023-2026 market data the model never sees. Runtime lookahead detection via truncation tests prevents data snooping.

80% Net Exposure Constraint

All positions must satisfy |net_notional| / gross_notional ≤ 0.80 at every bar. Forces genuine hedged trading, eliminating pure directional bets models default to.

Scoring System
Reward Structure

A cascading gate system where early failures block downstream scoring. Reward ranges from -6.0 (catastrophic) to +3.0 (exceptional alpha).

Reward Scale — gated cascade from -6.0 to +3.0
Stage 0
Stage 1a/1b
Stage 1c
Stage 2: Out-of-Sample Sharpe
-6.0 -5.0 -4.0 -3.0 0.0 ★ +2.784 +3.0
-6.0
Stage 0 Fail
Data alignment incorrect
-5.0 to -4.0
Stage 1a/1b Fail
Strategy implementation errors (continuous)
-4.0 to -3.0
Stage 1c Fail
Portfolio combination errors (continuous)
-3.0
Solution Gates Fail
Lookahead, exposure violation, or missing docs
-3.0 to +3.0
Stage 2: OOS Sharpe
Annualized Sharpe on hidden 2023-2026 test data
Results
Model Rankings

Ranked by average reward across 5 independent trials (pass@5). Ground truth achieves Sharpe 2.784 on the same hidden test data.

Gemini 3.1 Pro achieved +0.419 — the only positive out-of-sample Sharpe across all 30 trials

All models scored negative on average — no model generated consistent positive alpha
#2
Gemini 3.1 Pro
Google
-2.724
Avg Reward
-6+3
Gemini 3.1 Pro
Google
Best Trial+0.419
Worst Trial-4.041
Std Dev1.671
Trials5
All Scores+0.42, -3.0, -3.0, -4.0, -4.04
#1
Claude Opus 4.6
Anthropic
-2.098
Avg Reward
-6+3
Claude Opus 4.6
Anthropic
Best Trial-0.328
Worst Trial-3.000
Std Dev1.218
Trials5
All Scores-0.33, -1.16, -3.0, -3.0, -3.0
#3
GPT-5.4
OpenAI
-3.000
Avg Reward
-6+3
GPT-5.4
OpenAI
Best Trial-3.000
Worst Trial-3.000
Std Dev0.000
Trials5
All Scores-3.0, -3.0, -3.0, -3.0, -3.0
Human Expert Solution
+2.784
Sharpe
Complete Results
Full Leaderboard

All 6 models with complete trial statistics. Click column headers to sort.

All Models — cumulative results across 5 trials
Rank Model Avg Reward Best Worst Std Dev
Analysis
Performance Analysis

Trial-level score distribution and model comparison across the reward scale.

Trial Score Distribution
Individual trial scores across the reward scale — hover dots for details
Trial Score
Average
Ground Truth
Average Performance Comparison
Average reward with trial distribution range
Trial Score
Range
Diagnostic
Stage Gate Pass Rates

Pass rate per stage across 5 trials. Cyan = 5/5, orange = 3-4/5, red = 0-2/5.

Evaluation Gates — per model diagnostic