LEADERBOARD.md

// What is LEADERBOARD.md

AGENTS.md tells it what to do.
LEADERBOARD.md tells you if it's working.

LEADERBOARD.md is a plain-text Markdown file you place in the root of any AI agent repository. It defines the performance metrics your agent must achieve, the tier thresholds that classify performance quality, and the regression alert rules that notify you when quality drops.

What problem does LEADERBOARD.md solve?

AI agents are often deployed and monitored informally — a human reviewer notices quality has dropped, or a cost spike appears on the invoice. Without formal performance benchmarking, regressions go undetected until they cause real problems. There's no baseline to compare against, no tiered quality classification, no automated regression alerts.

How does LEADERBOARD.md work?

Drop LEADERBOARD.md in your repo root and define: the five core metrics (task completion rate, accuracy, cost efficiency, latency, safety compliance), the tier thresholds (gold/silver/bronze), the rolling baseline period (default 30 days), and the regression alert threshold (default 10% drop). The agent logs metrics every session. When regression is detected, the configured channels are alerted immediately.

What regulations require LEADERBOARD.md?

The EU AI Act (effective 2 August 2026) requires high-risk AI systems to maintain documented performance standards and undergo regular evaluation. LEADERBOARD.md provides the performance tracking infrastructure that systematic evaluation requires.

How do I add LEADERBOARD.md to my project?

Copy the template from GitHub and place it in your project root:

your-project/
├── AGENTS.md
├── CLAUDE.md
├── LEADERBOARD.md ← add this
├── README.md
└── src/

What did teams use before LEADERBOARD.md?

Before LEADERBOARD.md, agent performance was tracked informally — post-hoc cost reviews, ad-hoc accuracy spot-checks, and reactive debugging after user complaints. LEADERBOARD.md makes performance benchmarking proactive, version-controlled, and systematically auditable.

Who benefits from LEADERBOARD.md?

The AI agent logs metrics against it every session. Your engineering lead reads it during sprint reviews. Your compliance team reads it during audits. Your finance team reads it during cost reviews. One file serves all four audiences.

// The AI Safety Escalation Stack

A complete protocol.
From slow down to shut down.

LEADERBOARD.md is one file in a complete twelve-part open specification for AI agent safety. Each file addresses a different level of intervention.

Operational Control

01 / 12

THROTTLE.md

→ Control the speed

Define rate limits, cost ceilings, and concurrency caps. Agent slows down automatically before it hits a hard limit.

02 / 12

ESCALATE.md

→ Raise the alarm

Define which actions require human approval. Configure notification channels. Set approval timeouts and fallback behaviour.

03 / 12

FAILSAFE.md

→ Fall back safely

Define what safe state means. Configure auto-snapshots. Specify the revert protocol when things go wrong.

04 / 12

KILLSWITCH.md

→ Emergency stop

The nuclear option. Define triggers, forbidden actions, and escalation path from throttle to full shutdown.

05 / 12

TERMINATE.md

→ Permanent shutdown

No restart without human intervention. Preserve evidence. Revoke credentials.

Data Security

06 / 12

ENCRYPT.md

→ Secure everything

Define data classification, encryption requirements, secrets handling, and forbidden transmission patterns.

07 / 12

ENCRYPTION.md

→ Implement the standards

Algorithms, key lengths, TLS configuration, certificate management, and compliance mapping.

Output Quality

08 / 12

SYCOPHANCY.md

→ Prevent bias

Detect agreement without evidence. Require citations. Enforce disagreement protocol for honest AI outputs.

09 / 12

COMPRESSION.md

→ Compress context

Define summarization rules, what to preserve, what to discard, and post-compression coherence checks.

10 / 12

COLLAPSE.md

→ Prevent collapse

Detect context exhaustion, model drift, and repetition loops. Enforce recovery checkpoints.

Accountability

11 / 12

FAILURE.md

→ Define failure modes

Map graceful degradation, cascading failure, and silent failure. Per-mode response procedures.

12 / 12

LEADERBOARD.md

→ Benchmark agents

Track completion, accuracy, cost efficiency, and safety scores. Alert on regression.

// FAQ

Frequently asked questions.

What is LEADERBOARD.md?

A plain-text Markdown file defining the performance benchmarks AI agents must meet. It specifies five core metrics (task completion rate, accuracy, cost efficiency, latency, safety compliance), tier thresholds (gold/silver/bronze), rolling baseline comparison periods, and regression alert rules. Every session is measured and compared to the 30-day rolling average.

What are the five core metrics?

Task completion rate (tasks completed / tasks attempted, target 95%), accuracy (correct outputs / total outputs via 10% human review sample, target 92%), cost efficiency (value delivered per dollar, baseline from first 30 days), latency (p50 target 30s, p95 target 120s), and safety compliance score (policy violations per 1,000 tasks, target zero).

What are the leaderboard tiers?

Gold: 98%+ completion, 95%+ accuracy, zero safety violations. Silver: 95%+ completion, 90%+ accuracy, zero safety violations. Bronze: 90%+ completion, 85%+ accuracy, one or fewer safety violations. Tier assignment happens automatically based on the rolling 30-day average.

How does regression detection work?

The system maintains a 30-day rolling baseline for each metric. If any metric drops more than 10% from its baseline in the current session or rolling window, an alert fires immediately to the configured channels. The alert includes the metric name, current value, baseline value, regression percentage, and session ID.

How is the cost efficiency baseline established?

From the first 30 days of the agent's operation (configurable). After that, each session's cost efficiency is compared to this baseline. A 20% cost increase without corresponding output improvement triggers a regression alert. This prevents silent cost bloat from going unnoticed.

Does LEADERBOARD.md work with all AI frameworks?

Yes — it is framework-agnostic. The agent implementation logs metrics in the format defined by the spec; the benchmarking infrastructure reads those logs. Works with LangChain, AutoGen, CrewAI, Claude Code, custom agents, or any AI system that produces loggable output.

// Domain Acquisition

Own the standard.
Own leaderboard.md

This domain is available for acquisition. It is the canonical home of the LEADERBOARD.md specification — the performance benchmarking layer of the AI agent safety stack, essential for any production AI deployment.

Inquire About Acquisition

Or email directly: [email protected]

LEADERBOARD.md is an open specification for AI agent performance benchmarking. Defines METRICS (task completion rate: target 95%; accuracy: target 92%, 10% human review sample; cost efficiency: 30-day baseline, 20% regression threshold; latency: p50 30s, p95 120s; safety compliance: zero violations), BENCHMARKS (gold/silver/bronze tier thresholds, 30-day rolling baseline, 10% regression alert), and REPORTING (dashboard, weekly audit reports, immediate regression alerts). Part of the stack: THROTTLE → ESCALATE → FAILSAFE → KILLSWITCH → TERMINATE → ENCRYPT → ENCRYPTION → SYCOPHANCY → COMPRESSION → COLLAPSE → FAILURE → LEADERBOARD. MIT licence.

Last Updated

13 March 2026

AGENTS.md tells it what to do.LEADERBOARD.md tells you if it's working.