Question 1

What is AI Stupid Level?

Accepted Answer

AI Stupid Level is an independent benchmarking platform that monitors AI model performance over time. We run real coding tasks across multiple models to measure their capabilities objectively, detecting performance changes ("drift") that might otherwise go unnoticed. Think of us as a watchdog for AI quality.

Question 2

Are AI models really getting worse over time?

Accepted Answer

Sometimes, yes. Our drift detection system has identified multiple instances where AI models showed sustained performance degradation over 28-day periods. This can happen due to fine-tuning, safety updates, or infrastructure changes. However, not all models degrade—some remain stable or even improve. Our platform tracks these changes with statistical rigor so you can make informed decisions.

Question 3

How is this different from other AI benchmarks?

Accepted Answer

Most benchmarks (HumanEval, MMLU) show single measurements without uncertainty quantification. We run multiple trials (n=5) per model, calculate confidence intervals, and use statistical tests to distinguish real changes from noise. We also provide continuous monitoring with drift detection, not just one-time snapshots. Plus, everything is open source and independently verifiable.

Question 4

Is AI Stupid Level free to use?

Accepted Answer

Yes! All benchmark data, historical trends, and analysis are completely free. We also provide a public API for developers who want to integrate our data into their own tools. The platform is supported by community donations and grants, not by AI vendors.

Question 5

How do you score AI models?

Accepted Answer

We use a 7-axis scoring system: Correctness (30%), Spec Adherence (20%), Code Quality (15%), Efficiency (10%), Stability (10%), Refusal Rate (10%), and Recovery (5%). Each model runs coding tasks 5 times with different random seeds. We calculate the median score and provide 95% confidence intervals using t-distribution. Lower scores = better performance (less "stupid").

Question 6

Why do you run 5 trials instead of just 1?

Accepted Answer

AI models are stochastic (probabilistic), meaning the same prompt can produce different outputs. A single measurement could be a lucky or unlucky result. Running 5 trials lets us: (1) capture natural variance, (2) calculate confidence intervals, (3) use the median to avoid outlier bias, and (4) estimate true performance more accurately. It's a balance between statistical rigor and computational cost.

Question 7

What is drift detection and how does it work?

Accepted Answer

Drift detection identifies sustained performance changes over time. We use the CUSUM (Cumulative Sum) algorithm, which tracks cumulative deviations from a model's baseline. Unlike simple comparisons, CUSUM distinguishes between daily noise and actual trends. Each model has calibrated thresholds based on its historical variance—noisy models get higher thresholds to avoid false alarms.

Question 8

What tasks do you use for benchmarking?

Accepted Answer

We use real-world coding tasks in Python and TypeScript, covering algorithm implementation, debugging, code refactoring, optimization, and error recovery. Tasks are not publicly disclosed to prevent gaming. They represent practical problems developers face daily, not academic puzzles. You can verify tasks by running benchmarks with your own API keys.

Question 9

How accurate are your benchmarks?

Accepted Answer

We use 5-trial median scoring with 95% confidence intervals calculated using t-distribution (df=4). Our standard error is typically ±1-3 points on a 100-point scale. For example, a score of "24.8 ± 1.3" means we're 95% confident the true score is between 23.5 and 26.1. This is far more rigorous than single-shot benchmarks that show no uncertainty.

Question 10

Why use median instead of mean?

Accepted Answer

Median is robust to outliers. If one trial produces an anomalous result (model hallucination, API timeout, random brilliance), it won't skew the entire score. The median represents "typical" performance better than the mean when dealing with small sample sizes and potential outliers.

Question 11

Can I verify your results myself?

Accepted Answer

Absolutely! Use our "Test Your Keys" feature to run the same benchmarks with your own API keys. You'll get the same tasks, same scoring, same methodology—proving we're not making up numbers. Additionally, all our code is open source on GitHub, so you can audit every algorithm and even run the full platform locally.

Question 12

Do you have an API?

Accepted Answer

Yes! Our API provides access to current rankings, historical data, confidence intervals, and drift alerts. Endpoints include: GET /api/dashboard (current scores), GET /api/dashboard?period=7d (historical trends), and GET /api/models/:id (detailed model data). All data is free to access.

Question 13

What are confidence intervals and why do they matter?

Accepted Answer

Confidence intervals show the range where we're 95% confident the true score lies. For example, "24.8 ± 1.3" means [23.5, 26.1]. This matters because: (1) AI is probabilistic, (2) single measurements are unreliable, (3) you need to know measurement uncertainty to make decisions, and (4) overlapping intervals mean differences might not be statistically significant.

Question 14

How often do you update benchmarks?

Accepted Answer

We run benchmarks continuously, with most models tested multiple times per week. High-priority models (GPT-5, Claude Opus 4, Gemini 2.5 Pro) are tested daily. Historical data is preserved so you can track performance over weeks, months, or years. Drift detection runs automatically on each new benchmark.

Question 15

What programming languages do you test?

Accepted Answer

Currently Python and TypeScript, as these are the most common languages for AI-assisted development. We're expanding to include JavaScript, Java, C++, Go, and Rust in future updates. The core methodology remains the same across languages.

Question 16

Which AI model is best for coding?

Accepted Answer

It depends on your specific needs, but currently top performers include OpenAI's GPT-5 and o3, Anthropic's Claude Opus 4, and Google's Gemini 2.5 Pro. Check our live rankings for current scores with confidence intervals. Remember: "best" varies by task type—some models excel at algorithms while others are better at refactoring.

Question 17

How does GPT compare to Claude?

Accepted Answer

Both GPT-5 and Claude Opus 4 are top-tier models with different strengths. GPT-5 typically scores higher on correctness and algorithmic tasks, while Claude Opus 4 excels at code quality and following specifications. Check our /compare page for detailed head-to-head analysis with statistical significance tests.

Question 18

Are smaller/cheaper models worth using?

Accepted Answer

Depends on your use case. Models like GPT-4o-mini or Claude Sonnet 4 offer 80-90% of flagship performance at 1/10th the cost. For production applications with high volume, they're often the better economic choice. Our benchmarks show which capabilities you sacrifice for the cost savings.

Question 19

Do AI companies pay you to rank them higher?

Accepted Answer

No. We have zero financial relationships with OpenAI, Anthropic, Google, xAI, or any AI model provider. We don't accept sponsorships from vendors, we don't earn affiliate commissions, and all benchmarks run on our own infrastructure with our own API keys. Our rankings are purely merit-based.

Question 20

How do you fund this platform?

Accepted Answer

Through community donations, sponsorships from non-vendor companies, and research grants for AI evaluation projects. We explicitly refuse funding from AI model providers to maintain independence. All financial relationships are disclosed publicly.

Question 21

How can I trust your methodology?

Accepted Answer

Trust through verification, not claims: (1) All code is open source on GitHub, (2) Complete methodology documentation is public, (3) "Test Your Keys" lets you reproduce results, (4) Statistical methods are peer-reviewed, (5) Community can audit everything. We want you to verify, not just trust.

Question 22

How do I choose the right AI model for my project?

Accepted Answer

Consider: (1) Task complexity (simple tasks = smaller models OK), (2) Budget (cost per token varies 10x between models), (3) Latency requirements (some models are faster), (4) Stability needs (check our drift alerts), (5) Specific strengths (see axis breakdowns). Use our comparison tool to evaluate trade-offs.

Question 23

What do the different colors/alerts mean?

Accepted Answer

🟢 Normal = Performance within expected variance. 🟡 Warning = Slight decline detected, monitoring closely. 🟠 Degradation = Sustained decline confirmed, statistically significant. 🔴 Critical = Major performance drop, immediate attention needed. Alerts are based on CUSUM drift detection calibrated per model.

Question 24

Can I get notifications when a model degrades?

Accepted Answer

Not yet, but it's on our roadmap! We're planning email/webhook notifications for drift alerts, customizable per model. For now, check the dashboard regularly or follow our Twitter/Reddit for major announcements. You can also use our API to build your own monitoring system.

Question 25

What are the current limitations?

Accepted Answer

Main limitations: (1) 5 trials may not capture extreme variance, (2) Coding-focused (not general capabilities), (3) English language only, (4) Limited task diversity (expanding), (5) Confidence intervals not yet shown in all UI elements. We're actively working on all of these.

Question 26

What features are coming next?

Accepted Answer

Roadmap includes: (1) Adaptive sampling (more trials for uncertain cases), (2) Expanded task set with more languages, (3) Real-time error bars in charts, (4) Email/webhook drift alerts, (5) Statistical significance indicators between models, (6) Bayesian analysis for better uncertainty quantification, (7) Provider hub pages with vendor analysis.

Frequently Asked Questions

General

Methodology

Technical

Comparisons

Trust & Independence

Using the Platform

Limitations & Future

Still Have Questions?

📖 Read Documentation

💬 Join Discussion

🐦 Follow Updates

Explore More