About AI Stupid Level

We're an independent watchdog platform monitoring AI model performance to protect developers and businesses from undisclosed capability reductions.

Our Mission

In early 2024, developers noticed something troubling: AI models they relied on seemed to be performing worse over time. OpenAI's GPT-4 appeared "dumber" than at launch. Claude started refusing more requests. But no one was systematically tracking these changes.

AI Stupid Level was born from frustration. We built this platform because:

🎯 Our Goal: Provide the most rigorous, transparent, and statistically sound AI benchmarking platform availableβ€”completely free and open source.

Our Team

TA

The Architect

Lead Researcher & Platform Engineer

  • β€’ 10+ years in AI/ML infrastructure and performance optimization
  • β€’ Former Senior Engineer at enterprise AI platforms
  • β€’ Expert in statistical analysis and algorithm design
  • β€’ Open source contributor to ML tooling ecosystem

Contributing Researchers

Our methodology has been reviewed and validated by statisticians, ML researchers, and industry practitioners. We welcome contributions from the communityβ€”check our GitHub repositories: Web β€’ API to get involved.

Methodology Validation

Our statistical approach and benchmarking framework has been:

βœ…

Open Source Since 2024

500+ GitHub stars, community code reviews, full transparency in implementation

βœ…

Peer Reviewed

Statistical methodology reviewed by academic researchers in ML evaluation

βœ…

Community Validated

Referenced in technical blogs, Reddit discussions, and developer communities

βœ…

User Verifiable

"Test Your Keys" feature allows independent verification of all benchmarks

Funding & Independence

Our Independence Guarantee

  • βœ“
    100% Independent Funding

    Supported through community donations, sponsorships, and grant funding. No revenue from AI vendors.

  • βœ“
    No Vendor Relationships

    Zero financial relationships with OpenAI, Anthropic, Google, xAI, or any AI model provider.

  • βœ“
    No Affiliate Links

    We don't earn commissions from API signups or referrals. All rankings are merit-based.

  • βœ“
    Own Infrastructure

    All benchmarks run on our servers using our API keys. No vendor influence.

  • βœ“
    Transparent Methodology

    Complete source code, benchmark tasks, and scoring algorithms are publicly auditable.

How We Fund Operations

We explicitly do not accept funding from AI model providers to maintain independence and objectivity. Enterprise data licensing revenue helps keep our public platform free while maintaining our independence.

Enterprise Data Licensing

Beyond our free public platform, we offer premium enterprise datasets that provide deeper insights into AI model behavior, safety vulnerabilities, and performance patterns.

Available Enterprise Datasets

πŸ›‘οΈ

Safety & Security Dataset

Comprehensive adversarial testing results including jailbreak attempts, prompt injection vulnerabilities, safety bypass patterns, and model-specific security weaknesses.

  • β€’ 10,000+ adversarial test results per month
  • β€’ Vulnerability profiles by model and attack type
  • β€’ Safety bypass success rates and patterns
  • β€’ Compliance-ready security reports
βš–οΈ

Bias & Fairness Dataset

Statistical analysis of performance variations across demographic groups, gender bias indicators, and fairness metrics required for EU AI Act compliance.

  • β€’ 5,000+ demographic variant tests per month
  • β€’ Gender, ethnicity, and age bias analysis
  • β€’ EU AI Act compliance documentation
  • β€’ Fairness score reports and recommendations
🎯

Robustness & Reliability Dataset

Prompt sensitivity analysis, consistency metrics across paraphrasing variations, hallucination patterns, and behavioral stability measurements.

  • β€’ 15,000+ prompt variation tests per month
  • β€’ Hallucination detection and classification
  • β€’ Consistency and robustness scoring
  • β€’ Failure mode taxonomy and examples
πŸ“Š

Version & Regression Dataset

Model version tracking, performance regression root cause analysis, API update correlation, and historical performance genealogy for all major models.

  • β€’ Complete version change timeline
  • β€’ Regression diagnostics and root causes
  • β€’ Task-level performance attribution
  • β€’ Automated incident detection and alerts

Who Benefits from Our Enterprise Data?

🏒 AI Safety Teams

Red teaming, security audits, and vulnerability assessment for AI deployment strategies.

πŸ“‹ Compliance Officers

EU AI Act compliance, fairness audits, and regulatory documentation requirements.

πŸ”¬ ML Researchers

Academic research, model behavior analysis, and large-scale benchmarking studies.

πŸ’Ό Enterprise Architects

Model selection, vendor evaluation, and production deployment risk assessment.

πŸ›‘οΈ Security Analysts

Threat intelligence, vulnerability tracking, and AI security posture management.

πŸ“Š Data Scientists

Performance optimization, cost-benefit analysis, and model comparison research.

Interested in Enterprise Data Access?

Our enterprise datasets are continuously updated and include historical data going back to platform launch. Custom data packages, API access, and dedicated support available.

View Pricing & Contact Sales→

Note: Enterprise data licensing revenue helps fund our free public platform and keeps us independent from AI vendor influence. All enterprise datasets are derived from our open methodology.

Open Source & Transparency

Transparency is our core value. Everything about how we benchmark AI models is public:

πŸ“‚ Full Source Code

Every line of code is public on GitHub. Audit our methodology, suggest improvements, or run locally.

πŸ“Š Public API

All benchmark data accessible via API. Download historical scores, confidence intervals, and trends.

GET /api/dashboard

πŸ“– Detailed Documentation

Complete technical documentation of our 7-axis scoring, CUSUM drift detection, and statistical methods.

Read Methodology β†’

πŸ”‘ Test Your Keys

Run benchmarks with your own API keys to verify we're not making up numbers.

Test Now β†’

Why We Built This

The AI industry moves fastβ€”too fast for proper accountability. Models get updated silently. Capabilities change without notice. Developers building products on these APIs deserve better.

Real Problems We're Solving

❌ Problem: Silent Model Degradation

AI providers update models without announcing performance changes

βœ“ Our Solution: Continuous monitoring with drift detection alerts

❌ Problem: Unreliable Benchmarks

Most benchmarks show single measurements without uncertainty quantification

βœ“ Our Solution: Multiple trials with confidence intervals and statistical rigor

❌ Problem: Vendor Marketing

Official benchmarks are optimized for marketing, not real-world performance

βœ“ Our Solution: Independent testing with no vendor relationships

❌ Problem: No Historical Tracking

Can't tell if today's score is good or if the model declined

βœ“ Our Solution: Complete historical database with trend analysis

Our Values

πŸ”¬ Scientific Rigor

We use proper statistical methods, confidence intervals, and peer-reviewed algorithms. No hand-waving, no marketing fluffβ€”just math.

🌐 Radical Transparency

Everything is open source. Every decision documented. Every benchmark reproducible. Trust through verification, not through claims.

βš–οΈ Independence

No vendor funding. No affiliate revenue. No conflicts of interest. Our only loyalty is to developers who need accurate data.

🀝 Community First

Built by developers, for developers. We listen to feedback, accept contributions, and evolve based on community needs.

Get Involved

AI Stupid Level is a community project. Here's how you can contribute:

πŸ’» Contribute Code

Help improve the platform, add features, fix bugs, or enhance documentation.

πŸ“£ Spread the Word

Share our benchmarks, cite our data, or discuss our methodology in your communities.

🀝 Support Us

Help keep our servers running and benchmarks free for everyone.

Coming soon: Sponsorship options

Contact & Social

Connect With Us

For General Inquiries

For Technical Questions

Ready to Explore?