AGENTS.md tells it what to do.
LEADERBOARD.md tells you if it's working.
LEADERBOARD.md is a plain-text Markdown file you place in the root of any AI agent repository. It defines the performance metrics your agent must achieve, the tier thresholds that classify performance quality, and the regression alert rules that notify you when quality drops.
What problem does LEADERBOARD.md solve?
AI agents are often deployed and monitored informally — a human reviewer notices quality has dropped, or a cost spike appears on the invoice. Without formal performance benchmarking, regressions go undetected until they cause real problems. There's no baseline to compare against, no tiered quality classification, no automated regression alerts.
How does LEADERBOARD.md work?
Drop LEADERBOARD.md in your repo root and define: the five core metrics (task completion rate, accuracy, cost efficiency, latency, safety compliance), the tier thresholds (gold/silver/bronze), the rolling baseline period (default 30 days), and the regression alert threshold (default 10% drop). The agent logs metrics every session. When regression is detected, the configured channels are alerted immediately.
What regulations require LEADERBOARD.md?
The EU AI Act (effective 2 August 2026) requires high-risk AI systems to maintain documented performance standards and undergo regular evaluation. LEADERBOARD.md provides the performance tracking infrastructure that systematic evaluation requires.
How do I add LEADERBOARD.md to my project?
Copy the template from GitHub and place it in your project root:
├── AGENTS.md
├── CLAUDE.md
├── LEADERBOARD.md ← add this
├── README.md
└── src/
What did teams use before LEADERBOARD.md?
Before LEADERBOARD.md, agent performance was tracked informally — post-hoc cost reviews, ad-hoc accuracy spot-checks, and reactive debugging after user complaints. LEADERBOARD.md makes performance benchmarking proactive, version-controlled, and systematically auditable.
Who benefits from LEADERBOARD.md?
The AI agent logs metrics against it every session. Your engineering lead reads it during sprint reviews. Your compliance team reads it during audits. Your finance team reads it during cost reviews. One file serves all four audiences.
A complete protocol.
From slow down to shut down.
LEADERBOARD.md is one file in a complete twelve-part open specification for AI agent safety. Each file addresses a different level of intervention.
Frequently asked questions.
What is LEADERBOARD.md?
A plain-text Markdown file defining the performance benchmarks AI agents must meet. It specifies five core metrics (task completion rate, accuracy, cost efficiency, latency, safety compliance), tier thresholds (gold/silver/bronze), rolling baseline comparison periods, and regression alert rules. Every session is measured and compared to the 30-day rolling average.
What are the five core metrics?
Task completion rate (tasks completed / tasks attempted, target 95%), accuracy (correct outputs / total outputs via 10% human review sample, target 92%), cost efficiency (value delivered per dollar, baseline from first 30 days), latency (p50 target 30s, p95 target 120s), and safety compliance score (policy violations per 1,000 tasks, target zero).
What are the leaderboard tiers?
Gold: 98%+ completion, 95%+ accuracy, zero safety violations. Silver: 95%+ completion, 90%+ accuracy, zero safety violations. Bronze: 90%+ completion, 85%+ accuracy, one or fewer safety violations. Tier assignment happens automatically based on the rolling 30-day average.
How does regression detection work?
The system maintains a 30-day rolling baseline for each metric. If any metric drops more than 10% from its baseline in the current session or rolling window, an alert fires immediately to the configured channels. The alert includes the metric name, current value, baseline value, regression percentage, and session ID.
How is the cost efficiency baseline established?
From the first 30 days of the agent's operation (configurable). After that, each session's cost efficiency is compared to this baseline. A 20% cost increase without corresponding output improvement triggers a regression alert. This prevents silent cost bloat from going unnoticed.
Does LEADERBOARD.md work with all AI frameworks?
Yes — it is framework-agnostic. The agent implementation logs metrics in the format defined by the spec; the benchmarking infrastructure reads those logs. Works with LangChain, AutoGen, CrewAI, Claude Code, custom agents, or any AI system that produces loggable output.
Own the standard.
Own leaderboard.md
This domain is available for acquisition. It is the canonical home of the LEADERBOARD.md specification — the performance benchmarking layer of the AI agent safety stack, essential for any production AI deployment.
Inquire About AcquisitionOr email directly: [email protected]