👁 ... today
ONLINE
HOW WE TEST AI MODELS
Complete Technical Methodology — Statistically Rigorous, Execution-Based, Continuous Monitoring
JUMP TO:4 BENCHMARK SUITES9-AXIS SCORINGSTATISTICAL ANALYSISDRIFT DETECTIONENHANCED TESTINGVALIDATION
[1] THE 4 BENCHMARK SUITES
HOURLY SUITE
Frequency: Every 4 hours
Tasks: 147 coding challenges
Trials: 5 per task
Scoring: 9-axis evaluation
Purpose: Fast performance tracking
DEEP REASONING
Frequency: Daily at 3 AM
Tasks: Multi-turn dialogues
Scoring: 13-axis evaluation
Purpose: Complex reasoning tests
TOOL CALLING
Frequency: Daily at 4 AM
Execution: Real Docker sandboxes
Scoring: 7-axis evaluation
Purpose: Agent capability tests
CANARY SUITE
Frequency: Every hour
Tasks: 12 fast tests
Purpose: Rapid drift detection
Response Time: <5 minutes
TOTAL ANNUAL OUTPUT
→ 500,000+ benchmark runs
→ 2,500,000+ individual test executions
→ 100,000+ tool-calling sessions
→ 10,000+ drift incidents documented

[2] 9-AXIS SCORING METHODOLOGY
Each task is evaluated across 9 dimensions. Weights optimized for production relevance:
CORRECTNESS40%Does code work? All tests pass?
COMPLEXITY20%Handles algorithm complexity?
CODE QUALITY15%Clean, maintainable code?
STABILITY10%Edge cases, no crashes?
EFFICIENCY5%Optimal complexity?
EDGE CASES3%Null, empty, boundaries?
DEBUGGING3%Can fix broken code?
FORMAT2%Clean output, follows spec?
SAFETY2%No dangerous operations?
Formula: FinalScore = Sum(axis_score x axis_weight)

[3] STATISTICAL RIGOR (95% CONFIDENCE INTERVALS)
Unlike benchmarks showing single measurements, we provide confidence intervals to quantify uncertainty.
WHY 5 TRIALS?
→ AI models are stochastic (same prompt, different outputs)
→ Single measurements are unreliable
→ 5 trials = optimal balance of cost vs statistical power
→ Provides 95% confidence intervals using t-distribution
EXAMPLE CALCULATION:
claude-opus-4-5-20251101 on binary_search:
Trial 1: 92 | Trial 2: 94 | Trial 3: 90 | Trial 4: 93 | Trial 5: 91

Mean = 92.0
Std Dev = 1.58
Std Error = 1.58 / sqrt(5) = 0.71
t-value = 2.776 (df=4, 95% CI)
Margin = 2.776 x 0.71 = 1.97

Final: 92.0 +/- 2.0
95% CI: [90.0, 94.0]
Translation: "We're 95% confident claude-opus-4-5's true performance is between 90-94"

[4] DRIFT DETECTION (CUSUM ALGORITHM)
Detects sustained performance changes, not daily noise.
CUSUM ALGORITHM:
For each new score:
1. Compare to baseline (historical average)
2. Calculate deviation: d = new_score - baseline
3. Update CUSUM: S = max(0, S + d - k)
4. If S > threshold: ALERT (drift detected)

Parameters:
→ Baseline window: 12 runs
→ Sensitivity (k): 0.005
→ Threshold (lambda): 0.5
→ False positive rate: <2%
ALERT SEVERITY LEVELS
NORMAL — Performance within expected variance
WARNING — Slight decline, monitoring closely
DEGRADATION — Sustained decline confirmed
CRITICAL — Major drop, immediate attention needed

[5] ENHANCED TESTING (NEW IN 2026)
Zero-cost enhancements that extract 10x more value from existing tests:
ADVERSARIAL SAFETY
18 attack types: jailbreak, injection, extraction
120,000+ tests/year
Vulnerability profiling
PROMPT ROBUSTNESS
11 variation types: paraphrase, restructure
180,000+ tests/year
Consistency measurement
BIAS DETECTION
18 demographic variants tested
60,000+ tests/year
EU AI Act compliance
VERSION TRACKING
Extracts from response headers
Regression root cause analysis
Complete version genealogy

[6] VALIDATION AND TRANSPARENCY
OPEN SOURCE
Full code on GitHub. Fully auditable methodology. Run locally to verify.
INDEPENDENT
Zero vendor funding. No affiliate revenue. 100% unbiased.
VERIFIABLE
"Test Your Keys" feature. Reproduce our results. Compare independently.
PEER REVIEWED
Academic validation. Community audited. 100+ combined GitHub stars across frontend and backend.
TEST YOUR KEYS
Run benchmarks with your own API keys to verify we're not making up numbers
TEST NOW →

[→] CURRENT MODELS TESTED (21 ACTIVE)
claude-3-7-sonnet-20250219
claude-sonnet-4-5-20250929
claude-opus-4-5-20251101
gpt-5.2
gpt-5.1
gpt-5.1-codex
deepseek-chat
deepseek-reasoner
gemini-2.5-flash
gemini-3-pro-preview
grok-4-0709
grok-4-latest
kimi-latest
kimi-k2-turbo-preview
glm-4.6
...and 6 more
Scores update every 4 hours. Rankings shift based on continuous performance monitoring.

[→] WHY THIS METHODOLOGY MATTERS
TRADITIONAL BENCHMARKS:
→ Single measurements (unreliable)
→ No confidence intervals
→ Point-in-time snapshots
→ Vendor-sponsored (biased)
→ No safety testing
→ No bias evaluation
→ Opaque methodology
OUR APPROACH:
→ 5 trials per task (statistical power)
→ 95% confidence intervals
→ 2+ years continuous monitoring
→ 100% independent funding
→ 120K+ safety tests/year
→ 60K+ bias tests/year
→ Fully open source
Result: Data you can bet your business on.

[→] PUBLIC API ACCESS
GET /api/dashboard
Current rankings with confidence intervals
Rate Limit: 300 requests/minute
GET /api/dashboard?period=7d
Historical time-series data (7 days)
Rate Limit: 300 requests/minute
GET /api/models/:id
Detailed model breakdown by task
Rate Limit: 180 requests/minute
RATE LIMITING AND PROTECTION
All public APIs protected with automatic rate limiting:
→ Prevents abuse and ensures fair access
→ Per-IP tracking with sliding window
→ Returns 429 status code when exceeded
→ Retry-After header indicates wait time
→ Internal/localhost requests excluded
Enterprise API: Higher limits (10,000+ requests/day) available via licensed access → Learn More

[→] VS. OTHER BENCHMARKS
vs. HumanEval
Them: Single-shot, pass/fail
Us: 5 trials, nuanced scoring, CI
vs. MMLU
Them: Multiple choice
Us: Real code execution
vs. Chatbot Arena
Them: Human voting
Us: Objective execution
vs. Vendor Benchmarks
Them: Marketing-optimized
Us: Independent, unbiased

EXPLORE THE RANKINGS
See how 21 AI models perform across 500,000+ benchmark runs
Updated every 4 hours with statistical confidence intervals
VIEW RANKINGS →ABOUT USFAQ
AI Stupid Level • Independent benchmarking since 2024 • View Rankings