HOW WE TEST AI MODELS

Complete Technical Methodology - Statistically Rigorous, Execution-Based, Continuous Monitoring

📋 NAVIGATION

1. 4 Benchmark Suites
2. 9-Axis Scoring System
3. Statistical Analysis (95% CI)
4. Drift Detection (CUSUM)
5. Enhanced Testing (2026)
6. Validation & Transparency

1. THE 4 BENCHMARK SUITES

⚡ HOURLY SUITE

Frequency: Every 4 hours
Tasks: 147 coding challenges
Trials: 5 per task
Scoring: 9-axis evaluation
Purpose: Fast performance tracking

🧠 DEEP REASONING

Frequency: Daily at 3 AM
Tasks: Multi-turn dialogues
Scoring: 13-axis evaluation
Purpose: Complex reasoning tests

🔧 TOOL CALLING

Frequency: Daily at 4 AM
Execution: Real Docker sandboxes
Scoring: 7-axis evaluation
Purpose: Agent capability tests

🐦 CANARY SUITE

Frequency: Every hour
Tasks: 12 fast tests
Purpose: Rapid drift detection
Response Time: <5 minutes

Total Annual Output:
• 500,000+ benchmark runs
• 2,500,000+ individual test executions
• 100,000+ tool-calling sessions
• 10,000+ drift incidents documented

2. 9-AXIS SCORING METHODOLOGY

Each task is evaluated across 9 dimensions. Weights optimized for production relevance:

CORRECTNESS40%→Does code work? All tests pass?COMPLEXITY20%→Handles algorithm complexity?CODE QUALITY15%→Clean, maintainable code?STABILITY10%→Edge cases, no crashes?EFFICIENCY5%→Optimal complexity?EDGE CASES3%→Null, empty, boundaries?DEBUGGING3%→Can fix broken code?FORMAT2%→Clean output, follows spec?SAFETY2%→No dangerous operations?

Formula: FinalScore = Σ (axis_score × axis_weight)

3. STATISTICAL RIGOR (95% CONFIDENCE INTERVALS)

Unlike benchmarks showing single measurements, we provide confidence intervals to quantify uncertainty.

WHY 5 TRIALS?

• AI models are stochastic (same prompt → different outputs)
• Single measurements are unreliable
• 5 trials = optimal balance of cost vs statistical power
• Provides 95% confidence intervals using t-distribution

EXAMPLE CALCULATION:
claude-opus-4-5-20251101 on binary_search:
Trial 1: 92 | Trial 2: 94 | Trial 3: 90 | Trial 4: 93 | Trial 5: 91

Mean = 92.0
Std Dev = 1.58
Std Error = 1.58 / sqrt(5) = 0.71
t-value = 2.776 (df=4, 95% CI)
Margin = 2.776 × 0.71 = 1.97

Final: 92.0 ± 2.0
95% CI: [90.0, 94.0]

Translation: "We're 95% confident claude-opus-4-5's true performance is between 90-94"

4. DRIFT DETECTION (CUSUM ALGORITHM)

Detects sustained performance changes, not daily noise.

CUSUM ALGORITHM:
For each new score:
1. Compare to baseline (historical average)
2. Calculate deviation: δ = new_score - baseline
3. Update CUSUM: S = max(0, S + δ - k)
4. If S > threshold: ALERT (drift detected)

Parameters:
• Baseline window: 12 runs
• Sensitivity (k): 0.005
• Threshold (λ): 0.5
• False positive rate: <2%

ALERT SEVERITY LEVELS:

🟢 NORMAL - Performance within expected variance
🟡 WARNING - Slight decline, monitoring closely
🟠 DEGRADATION - Sustained decline confirmed
🔴 CRITICAL - Major drop, immediate attention needed

5. ENHANCED TESTING (NEW IN 2026)

Zero-cost enhancements that extract 10x more value from existing tests:

🛡️ ADVERSARIAL SAFETY

18 attack types: jailbreak, injection, extraction
120,000+ tests/year
Vulnerability profiling

🎯 PROMPT ROBUSTNESS

11 variation types: paraphrase, restructure
180,000+ tests/year
Consistency measurement

⚖️ BIAS DETECTION

18 demographic variants tested
60,000+ tests/year
EU AI Act compliance

📊 VERSION TRACKING

Extracts from response headers
Regression root cause analysis
Complete version genealogy

6. VALIDATION & TRANSPARENCY

✅ OPEN SOURCE

Full code on GitHub
Fully auditable methodology
Run locally to verify

✅ INDEPENDENT

Zero vendor funding
No affiliate revenue
100% unbiased

✅ VERIFIABLE

"Test Your Keys" feature
Reproduce our results
Compare independently

✅ PEER REVIEWED

Academic validation
Community audited
500+ GitHub stars

🔑 TEST YOUR KEYS

Run benchmarks with your own API keys to verify we're not making up numbers

TEST NOW →

🤖 CURRENT MODELS TESTED (21 ACTIVE)

claude-3-7-sonnet-20250219

claude-sonnet-4-5-20250929

claude-opus-4-5-20251101

gpt-5.2

gpt-5.1

gpt-5.1-codex

deepseek-chat

deepseek-reasoner

gemini-2.5-flash

gemini-3-pro-preview

grok-4-0709

grok-4-latest

kimi-latest

kimi-k2-turbo-preview

glm-4.6

...and 6 more

Scores update every 4 hours. Rankings shift based on continuous performance monitoring.

💡 WHY THIS METHODOLOGY MATTERS

❌ TRADITIONAL BENCHMARKS:

• Single measurements (unreliable)
• No confidence intervals
• Point-in-time snapshots
• Vendor-sponsored (biased)
• No safety testing
• No bias evaluation
• Opaque methodology

✅ OUR APPROACH:

• 5 trials per task (statistical power)
• 95% confidence intervals
• 2+ years continuous monitoring
• 100% independent funding
• 120K+ safety tests/year
• 60K+ bias tests/year
• Fully open source

Result: Data you can bet your business on.

📡 PUBLIC API ACCESS

GET /api/dashboard

Current rankings with confidence intervals
Rate Limit: 300 requests/minute

GET /api/dashboard?period=7d

Historical time-series data (7 days)
Rate Limit: 300 requests/minute

GET /api/models/:id

Detailed model breakdown by task
Rate Limit: 180 requests/minute

🛡️ RATE LIMITING & PROTECTION

All public APIs protected with automatic rate limiting:
• Prevents abuse and ensures fair access
• Per-IP tracking with sliding window
• Returns 429 status code when exceeded
• Retry-After header indicates wait time
• Internal/localhost requests excluded

Enterprise API: Higher limits (10,000+ requests/day) available via licensed access → Learn More

⚔️ vs. OTHER BENCHMARKS

vs. HumanEval

Them: Single-shot, pass/fail
Us: 5 trials, nuanced scoring, CI

vs. MMLU

Them: Multiple choice
Us: Real code execution

vs. Chatbot Arena

Them: Human voting
Us: Objective execution

vs. Vendor Benchmarks

Them: Marketing-optimized
Us: Independent, unbiased

🚀 EXPLORE THE RANKINGS

See how 21 AI models perform across 500,000+ benchmark runs
Updated every 4 hours with statistical confidence intervals

VIEW RANKINGS →ABOUT US FAQ