Frequently Asked Questions
Everything you need to know about AI model benchmarking, performance testing, and our methodology.
General
- Q:What is AI Stupid Level?
- A: AI Stupid Level is an independent benchmarking platform that monitors AI model performance over time. We run real coding tasks across multiple models to measure their capabilities objectively, detecting performance changes ("drift") that might otherwise go unnoticed. Think of us as a watchdog for AI quality.
- Q:Are AI models really getting worse over time?
- A: Sometimes, yes. Our drift detection system has identified multiple instances where AI models showed sustained performance degradation over 28-day periods. This can happen due to fine-tuning, safety updates, or infrastructure changes. However, not all models degrade—some remain stable or even improve. Our platform tracks these changes with statistical rigor so you can make informed decisions.
- Q:How is this different from other AI benchmarks?
- A: Most benchmarks (HumanEval, MMLU) show single measurements without uncertainty quantification. We run multiple trials (n=5) per model, calculate confidence intervals, and use statistical tests to distinguish real changes from noise. We also provide continuous monitoring with drift detection, not just one-time snapshots. Plus, everything is open source and independently verifiable.
- Q:Is AI Stupid Level free to use?
- A: Yes! All benchmark data, historical trends, and analysis are completely free. We also provide a public API for developers who want to integrate our data into their own tools. The platform is supported by community donations and grants, not by AI vendors.
Methodology
- Q:How do you score AI models?
- A: We use a 7-axis scoring system: Correctness (30%), Spec Adherence (20%), Code Quality (15%), Efficiency (10%), Stability (10%), Refusal Rate (10%), and Recovery (5%). Each model runs coding tasks 5 times with different random seeds. We calculate the median score and provide 95% confidence intervals using t-distribution. Lower scores = better performance (less "stupid").
- Q:Why do you run 5 trials instead of just 1?
- A: AI models are stochastic (probabilistic), meaning the same prompt can produce different outputs. A single measurement could be a lucky or unlucky result. Running 5 trials lets us: (1) capture natural variance, (2) calculate confidence intervals, (3) use the median to avoid outlier bias, and (4) estimate true performance more accurately. It's a balance between statistical rigor and computational cost.
- Q:What is drift detection and how does it work?
- A: Drift detection identifies sustained performance changes over time. We use the CUSUM (Cumulative Sum) algorithm, which tracks cumulative deviations from a model's baseline. Unlike simple comparisons, CUSUM distinguishes between daily noise and actual trends. Each model has calibrated thresholds based on its historical variance—noisy models get higher thresholds to avoid false alarms.
- Q:What tasks do you use for benchmarking?
- A: We use real-world coding tasks in Python and TypeScript, covering algorithm implementation, debugging, code refactoring, optimization, and error recovery. Tasks are not publicly disclosed to prevent gaming. They represent practical problems developers face daily, not academic puzzles. You can verify tasks by running benchmarks with your own API keys.
- Q:How accurate are your benchmarks?
- A: We use 5-trial median scoring with 95% confidence intervals calculated using t-distribution (df=4). Our standard error is typically ±1-3 points on a 100-point scale. For example, a score of "24.8 ± 1.3" means we're 95% confident the true score is between 23.5 and 26.1. This is far more rigorous than single-shot benchmarks that show no uncertainty.
- Q:Why use median instead of mean?
- A: Median is robust to outliers. If one trial produces an anomalous result (model hallucination, API timeout, random brilliance), it won't skew the entire score. The median represents "typical" performance better than the mean when dealing with small sample sizes and potential outliers.
Technical
- Q:Can I verify your results myself?
- A: Absolutely! Use our "Test Your Keys" feature to run the same benchmarks with your own API keys. You'll get the same tasks, same scoring, same methodology—proving we're not making up numbers. Additionally, all our code is open source on GitHub, so you can audit every algorithm and even run the full platform locally.
- Q:Do you have an API?
- A: Yes! Our API provides access to current rankings, historical data, confidence intervals, and drift alerts. Endpoints include: GET /api/dashboard (current scores), GET /api/dashboard?period=7d (historical trends), and GET /api/models/:id (detailed model data). All data is free to access.
- Q:What are confidence intervals and why do they matter?
- A: Confidence intervals show the range where we're 95% confident the true score lies. For example, "24.8 ± 1.3" means [23.5, 26.1]. This matters because: (1) AI is probabilistic, (2) single measurements are unreliable, (3) you need to know measurement uncertainty to make decisions, and (4) overlapping intervals mean differences might not be statistically significant.
- Q:How often do you update benchmarks?
- A: We run benchmarks continuously, with most models tested multiple times per week. High-priority models (GPT-5, Claude Opus 4, Gemini 2.5 Pro) are tested daily. Historical data is preserved so you can track performance over weeks, months, or years. Drift detection runs automatically on each new benchmark.
- Q:What programming languages do you test?
- A: Currently Python and TypeScript, as these are the most common languages for AI-assisted development. We're expanding to include JavaScript, Java, C++, Go, and Rust in future updates. The core methodology remains the same across languages.
Comparisons
- Q:Which AI model is best for coding?
- A: It depends on your specific needs, but currently top performers include OpenAI's GPT-5 and o3, Anthropic's Claude Opus 4, and Google's Gemini 2.5 Pro. Check our live rankings for current scores with confidence intervals. Remember: "best" varies by task type—some models excel at algorithms while others are better at refactoring.
- Q:How does GPT compare to Claude?
- A: Both GPT-5 and Claude Opus 4 are top-tier models with different strengths. GPT-5 typically scores higher on correctness and algorithmic tasks, while Claude Opus 4 excels at code quality and following specifications. Check our /compare page for detailed head-to-head analysis with statistical significance tests.
- Q:Are smaller/cheaper models worth using?
- A: Depends on your use case. Models like GPT-4o-mini or Claude Sonnet 4 offer 80-90% of flagship performance at 1/10th the cost. For production applications with high volume, they're often the better economic choice. Our benchmarks show which capabilities you sacrifice for the cost savings.
Trust & Independence
- Q:Do AI companies pay you to rank them higher?
- A: No. We have zero financial relationships with OpenAI, Anthropic, Google, xAI, or any AI model provider. We don't accept sponsorships from vendors, we don't earn affiliate commissions, and all benchmarks run on our own infrastructure with our own API keys. Our rankings are purely merit-based.
- Q:How do you fund this platform?
- A: Through community donations, sponsorships from non-vendor companies, and research grants for AI evaluation projects. We explicitly refuse funding from AI model providers to maintain independence. All financial relationships are disclosed publicly.
- Q:How can I trust your methodology?
- A: Trust through verification, not claims: (1) All code is open source on GitHub, (2) Complete methodology documentation is public, (3) "Test Your Keys" lets you reproduce results, (4) Statistical methods are peer-reviewed, (5) Community can audit everything. We want you to verify, not just trust.
Using the Platform
- Q:How do I choose the right AI model for my project?
- A: Consider: (1) Task complexity (simple tasks = smaller models OK), (2) Budget (cost per token varies 10x between models), (3) Latency requirements (some models are faster), (4) Stability needs (check our drift alerts), (5) Specific strengths (see axis breakdowns). Use our comparison tool to evaluate trade-offs.
- Q:What do the different colors/alerts mean?
- A: 🟢 Normal = Performance within expected variance. 🟡 Warning = Slight decline detected, monitoring closely. 🟠 Degradation = Sustained decline confirmed, statistically significant. 🔴 Critical = Major performance drop, immediate attention needed. Alerts are based on CUSUM drift detection calibrated per model.
- Q:Can I get notifications when a model degrades?
- A: Not yet, but it's on our roadmap! We're planning email/webhook notifications for drift alerts, customizable per model. For now, check the dashboard regularly or follow our Twitter/Reddit for major announcements. You can also use our API to build your own monitoring system.
Limitations & Future
- Q:What are the current limitations?
- A: Main limitations: (1) 5 trials may not capture extreme variance, (2) Coding-focused (not general capabilities), (3) English language only, (4) Limited task diversity (expanding), (5) Confidence intervals not yet shown in all UI elements. We're actively working on all of these.
- Q:What features are coming next?
- A: Roadmap includes: (1) Adaptive sampling (more trials for uncertain cases), (2) Expanded task set with more languages, (3) Real-time error bars in charts, (4) Email/webhook drift alerts, (5) Statistical significance indicators between models, (6) Bayesian analysis for better uncertainty quantification, (7) Provider hub pages with vendor analysis.
Still Have Questions?
Can't find what you're looking for? We're here to help.