Q:How do you score AI models?
A:We use a 7-axis scoring system: Correctness (35%), Spec Adherence (15%), Code Quality (15%), Efficiency (10%), Stability (10%), Refusal Rate (10%), and Recovery (5%). Each model runs coding tasks 5 times with different random seeds. We calculate the median score and provide 95% confidence intervals using t-distribution.
Q:Why do you run 5 trials instead of just 1?
A:AI models are stochastic (probabilistic), meaning the same prompt can produce different outputs. A single measurement could be a lucky or unlucky result. Running 5 trials lets us: (1) capture natural variance, (2) calculate confidence intervals, (3) use the median to avoid outlier bias, and (4) estimate true performance more accurately. It's a balance between statistical rigor and computational cost.
Q:What is drift detection and how does it work?
A:Drift detection identifies sustained performance changes over time. We use the CUSUM (Cumulative Sum) algorithm, which tracks cumulative deviations from a model's baseline. Unlike simple comparisons, CUSUM distinguishes between daily noise and actual trends. Each model has calibrated thresholds based on its historical variance — noisy models get higher thresholds to avoid false alarms.
Q:What tasks do you use for benchmarking?
A:We use real-world coding tasks in Python and TypeScript, covering algorithm implementation, debugging, code refactoring, optimization, and error recovery. Tasks are not publicly disclosed to prevent gaming. They represent practical problems developers face daily, not academic puzzles. You can verify tasks by running benchmarks with your own API keys.
Q:How accurate are your benchmarks?
A:We use 5-trial median scoring with 95% confidence intervals calculated using t-distribution (df=4). Our standard error is typically ±1-3 points on a 100-point scale. For example, a score of "24.8 ± 1.3" means we're 95% confident the true score is between 23.5 and 26.1. This is far more rigorous than single-shot benchmarks that show no uncertainty.
Q:Why use median instead of mean?
A:Median is robust to outliers. If one trial produces an anomalous result (model hallucination, API timeout, random brilliance), it won't skew the entire score. The median represents typical performance better than the mean when dealing with small sample sizes and potential outliers.