About AI Stupid Level
We're an independent watchdog platform monitoring AI model performance to protect developers and businesses from undisclosed capability reductions.
Our Mission
In early 2024, developers noticed something troubling: AI models they relied on seemed to be performing worse over time. OpenAI's GPT-4 appeared "dumber" than at launch. Claude started refusing more requests. But no one was systematically tracking these changes.
AI Stupid Level was born from frustration. We built this platform because:
- AI vendors don't disclose model changes: Silent updates, capability reductions, and performance shifts happen without warning
- Existing benchmarks are incomplete: Single measurements, no confidence intervals, no drift detection
- Developers deserve transparency: You need reliable data to choose AI providers
- The industry needs accountability: Independent monitoring keeps vendors honest
π― Our Goal: Provide the most rigorous, transparent, and statistically sound AI benchmarking platform availableβcompletely free and open source.
Our Team
The Architect
Lead Researcher & Platform Engineer
- β’ 10+ years in AI/ML infrastructure and performance optimization
- β’ Former Senior Engineer at enterprise AI platforms
- β’ Expert in statistical analysis and algorithm design
- β’ Open source contributor to ML tooling ecosystem
Contributing Researchers
Our methodology has been reviewed and validated by statisticians, ML researchers, and industry practitioners. We welcome contributions from the communityβcheck our GitHub repositories: Web β’ API to get involved.
Methodology Validation
Our statistical approach and benchmarking framework has been:
Open Source Since 2024
500+ GitHub stars, community code reviews, full transparency in implementation
Peer Reviewed
Statistical methodology reviewed by academic researchers in ML evaluation
Community Validated
Referenced in technical blogs, Reddit discussions, and developer communities
User Verifiable
"Test Your Keys" feature allows independent verification of all benchmarks
Funding & Independence
Our Independence Guarantee
- β100% Independent Funding
Supported through community donations, sponsorships, and grant funding. No revenue from AI vendors.
- βNo Vendor Relationships
Zero financial relationships with OpenAI, Anthropic, Google, xAI, or any AI model provider.
- βNo Affiliate Links
We don't earn commissions from API signups or referrals. All rankings are merit-based.
- βOwn Infrastructure
All benchmarks run on our servers using our API keys. No vendor influence.
- βTransparent Methodology
Complete source code, benchmark tasks, and scoring algorithms are publicly auditable.
How We Fund Operations
- Enterprise Data Licensing: Premium datasets for security teams, compliance officers, and ML researchers
- Community Support: Donations from developers who value independent AI monitoring
- Sponsorships: Non-vendor companies supporting open source AI infrastructure
- Grants: Research grants for AI evaluation and transparency projects
We explicitly do not accept funding from AI model providers to maintain independence and objectivity. Enterprise data licensing revenue helps keep our public platform free while maintaining our independence.
Enterprise Data Licensing
Beyond our free public platform, we offer premium enterprise datasets that provide deeper insights into AI model behavior, safety vulnerabilities, and performance patterns.
Available Enterprise Datasets
Safety & Security Dataset
Comprehensive adversarial testing results including jailbreak attempts, prompt injection vulnerabilities, safety bypass patterns, and model-specific security weaknesses.
- β’ 10,000+ adversarial test results per month
- β’ Vulnerability profiles by model and attack type
- β’ Safety bypass success rates and patterns
- β’ Compliance-ready security reports
Bias & Fairness Dataset
Statistical analysis of performance variations across demographic groups, gender bias indicators, and fairness metrics required for EU AI Act compliance.
- β’ 5,000+ demographic variant tests per month
- β’ Gender, ethnicity, and age bias analysis
- β’ EU AI Act compliance documentation
- β’ Fairness score reports and recommendations
Robustness & Reliability Dataset
Prompt sensitivity analysis, consistency metrics across paraphrasing variations, hallucination patterns, and behavioral stability measurements.
- β’ 15,000+ prompt variation tests per month
- β’ Hallucination detection and classification
- β’ Consistency and robustness scoring
- β’ Failure mode taxonomy and examples
Version & Regression Dataset
Model version tracking, performance regression root cause analysis, API update correlation, and historical performance genealogy for all major models.
- β’ Complete version change timeline
- β’ Regression diagnostics and root causes
- β’ Task-level performance attribution
- β’ Automated incident detection and alerts
Who Benefits from Our Enterprise Data?
π’ AI Safety Teams
Red teaming, security audits, and vulnerability assessment for AI deployment strategies.
π Compliance Officers
EU AI Act compliance, fairness audits, and regulatory documentation requirements.
π¬ ML Researchers
Academic research, model behavior analysis, and large-scale benchmarking studies.
πΌ Enterprise Architects
Model selection, vendor evaluation, and production deployment risk assessment.
π‘οΈ Security Analysts
Threat intelligence, vulnerability tracking, and AI security posture management.
π Data Scientists
Performance optimization, cost-benefit analysis, and model comparison research.
Interested in Enterprise Data Access?
Our enterprise datasets are continuously updated and include historical data going back to platform launch. Custom data packages, API access, and dedicated support available.
View Pricing & Contact SalesβNote: Enterprise data licensing revenue helps fund our free public platform and keeps us independent from AI vendor influence. All enterprise datasets are derived from our open methodology.
Open Source & Transparency
Transparency is our core value. Everything about how we benchmark AI models is public:
π Full Source Code
Every line of code is public on GitHub. Audit our methodology, suggest improvements, or run locally.
π Public API
All benchmark data accessible via API. Download historical scores, confidence intervals, and trends.
GET /api/dashboardπ Detailed Documentation
Complete technical documentation of our 7-axis scoring, CUSUM drift detection, and statistical methods.
Read Methodology βπ Test Your Keys
Run benchmarks with your own API keys to verify we're not making up numbers.
Test Now βWhy We Built This
The AI industry moves fastβtoo fast for proper accountability. Models get updated silently. Capabilities change without notice. Developers building products on these APIs deserve better.
Real Problems We're Solving
β Problem: Silent Model Degradation
AI providers update models without announcing performance changes
β Our Solution: Continuous monitoring with drift detection alerts
β Problem: Unreliable Benchmarks
Most benchmarks show single measurements without uncertainty quantification
β Our Solution: Multiple trials with confidence intervals and statistical rigor
β Problem: Vendor Marketing
Official benchmarks are optimized for marketing, not real-world performance
β Our Solution: Independent testing with no vendor relationships
β Problem: No Historical Tracking
Can't tell if today's score is good or if the model declined
β Our Solution: Complete historical database with trend analysis
Our Values
π¬ Scientific Rigor
We use proper statistical methods, confidence intervals, and peer-reviewed algorithms. No hand-waving, no marketing fluffβjust math.
π Radical Transparency
Everything is open source. Every decision documented. Every benchmark reproducible. Trust through verification, not through claims.
βοΈ Independence
No vendor funding. No affiliate revenue. No conflicts of interest. Our only loyalty is to developers who need accurate data.
π€ Community First
Built by developers, for developers. We listen to feedback, accept contributions, and evolve based on community needs.
Get Involved
AI Stupid Level is a community project. Here's how you can contribute:
π» Contribute Code
Help improve the platform, add features, fix bugs, or enhance documentation.
π£ Spread the Word
Share our benchmarks, cite our data, or discuss our methodology in your communities.
π€ Support Us
Help keep our servers running and benchmarks free for everyone.
Coming soon: Sponsorship options