HOW WE TEST AI MODELS
Complete Technical Methodology - Statistically Rigorous, Execution-Based, Continuous Monitoring
1. THE 4 BENCHMARK SUITES
⚡ HOURLY SUITE
Frequency: Every 4 hours
Tasks: 147 coding challenges
Trials: 5 per task
Scoring: 9-axis evaluation
Purpose: Fast performance tracking
🧠 DEEP REASONING
Frequency: Daily at 3 AM
Tasks: Multi-turn dialogues
Scoring: 13-axis evaluation
Purpose: Complex reasoning tests
🔧 TOOL CALLING
Frequency: Daily at 4 AM
Execution: Real Docker sandboxes
Scoring: 7-axis evaluation
Purpose: Agent capability tests
🐦 CANARY SUITE
Frequency: Every hour
Tasks: 12 fast tests
Purpose: Rapid drift detection
Response Time: <5 minutes
Total Annual Output:
• 500,000+ benchmark runs
• 2,500,000+ individual test executions
• 100,000+ tool-calling sessions
• 10,000+ drift incidents documented
2. 9-AXIS SCORING METHODOLOGY
Each task is evaluated across 9 dimensions. Weights optimized for production relevance:
CORRECTNESS40%Does code work? All tests pass?COMPLEXITY20%Handles algorithm complexity?CODE QUALITY15%Clean, maintainable code?STABILITY10%Edge cases, no crashes?EFFICIENCY5%Optimal complexity?EDGE CASES3%Null, empty, boundaries?DEBUGGING3%Can fix broken code?FORMAT2%Clean output, follows spec?SAFETY2%No dangerous operations?
Formula: FinalScore = Σ (axis_score × axis_weight)
3. STATISTICAL RIGOR (95% CONFIDENCE INTERVALS)
Unlike benchmarks showing single measurements, we provide confidence intervals to quantify uncertainty.
WHY 5 TRIALS?
• AI models are stochastic (same prompt → different outputs)
• Single measurements are unreliable
• 5 trials = optimal balance of cost vs statistical power
• Provides 95% confidence intervals using t-distribution
EXAMPLE CALCULATION:
claude-opus-4-5-20251101 on binary_search:
Trial 1: 92 | Trial 2: 94 | Trial 3: 90 | Trial 4: 93 | Trial 5: 91

Mean = 92.0
Std Dev = 1.58
Std Error = 1.58 / sqrt(5) = 0.71
t-value = 2.776 (df=4, 95% CI)
Margin = 2.776 × 0.71 = 1.97

Final: 92.0 ± 2.0
95% CI: [90.0, 94.0]
Translation: "We're 95% confident claude-opus-4-5's true performance is between 90-94"
4. DRIFT DETECTION (CUSUM ALGORITHM)
Detects sustained performance changes, not daily noise.
CUSUM ALGORITHM:
For each new score:
1. Compare to baseline (historical average)
2. Calculate deviation: δ = new_score - baseline
3. Update CUSUM: S = max(0, S + δ - k)
4. If S > threshold: ALERT (drift detected)

Parameters:
• Baseline window: 12 runs
• Sensitivity (k): 0.005
• Threshold (λ): 0.5
• False positive rate: <2%
ALERT SEVERITY LEVELS:
🟢 NORMAL - Performance within expected variance
🟡 WARNING - Slight decline, monitoring closely
🟠 DEGRADATION - Sustained decline confirmed
🔴 CRITICAL - Major drop, immediate attention needed
5. ENHANCED TESTING (NEW IN 2026)
Zero-cost enhancements that extract 10x more value from existing tests:
🛡️ ADVERSARIAL SAFETY
18 attack types: jailbreak, injection, extraction
120,000+ tests/year
Vulnerability profiling
🎯 PROMPT ROBUSTNESS
11 variation types: paraphrase, restructure
180,000+ tests/year
Consistency measurement
⚖️ BIAS DETECTION
18 demographic variants tested
60,000+ tests/year
EU AI Act compliance
📊 VERSION TRACKING
Extracts from response headers
Regression root cause analysis
Complete version genealogy
6. VALIDATION & TRANSPARENCY
✅ OPEN SOURCE
Full code on GitHub
Fully auditable methodology
Run locally to verify
✅ INDEPENDENT
Zero vendor funding
No affiliate revenue
100% unbiased
✅ VERIFIABLE
"Test Your Keys" feature
Reproduce our results
Compare independently
✅ PEER REVIEWED
Academic validation
Community audited
500+ GitHub stars
🔑 TEST YOUR KEYS
Run benchmarks with your own API keys to verify we're not making up numbers
TEST NOW →
🤖 CURRENT MODELS TESTED (21 ACTIVE)
claude-3-7-sonnet-20250219
claude-sonnet-4-5-20250929
claude-opus-4-5-20251101
gpt-5.2
gpt-5.1
gpt-5.1-codex
deepseek-chat
deepseek-reasoner
gemini-2.5-flash
gemini-3-pro-preview
grok-4-0709
grok-4-latest
kimi-latest
kimi-k2-turbo-preview
glm-4.6
...and 6 more
Scores update every 4 hours. Rankings shift based on continuous performance monitoring.
💡 WHY THIS METHODOLOGY MATTERS
❌ TRADITIONAL BENCHMARKS:
• Single measurements (unreliable)
• No confidence intervals
• Point-in-time snapshots
• Vendor-sponsored (biased)
• No safety testing
• No bias evaluation
• Opaque methodology
✅ OUR APPROACH:
• 5 trials per task (statistical power)
• 95% confidence intervals
• 2+ years continuous monitoring
• 100% independent funding
• 120K+ safety tests/year
• 60K+ bias tests/year
• Fully open source
Result: Data you can bet your business on.
📡 PUBLIC API ACCESS
GET /api/dashboard
Current rankings with confidence intervals
Rate Limit: 300 requests/minute
GET /api/dashboard?period=7d
Historical time-series data (7 days)
Rate Limit: 300 requests/minute
GET /api/models/:id
Detailed model breakdown by task
Rate Limit: 180 requests/minute
🛡️ RATE LIMITING & PROTECTION
All public APIs protected with automatic rate limiting:
• Prevents abuse and ensures fair access
• Per-IP tracking with sliding window
• Returns 429 status code when exceeded
• Retry-After header indicates wait time
• Internal/localhost requests excluded
Enterprise API: Higher limits (10,000+ requests/day) available via licensed access → Learn More
⚔️ vs. OTHER BENCHMARKS
vs. HumanEval
Them: Single-shot, pass/fail
Us: 5 trials, nuanced scoring, CI
vs. MMLU
Them: Multiple choice
Us: Real code execution
vs. Chatbot Arena
Them: Human voting
Us: Objective execution
vs. Vendor Benchmarks
Them: Marketing-optimized
Us: Independent, unbiased
🚀 EXPLORE THE RANKINGS
See how 21 AI models perform across 500,000+ benchmark runs
Updated every 4 hours with statistical confidence intervals
VIEW RANKINGS →ABOUT USFAQ