# LEADERBOARD.md — AI Agent Performance Benchmarking Protocol (Full Specification) **Home:** https://leaderboard.md **Repository:** https://github.com/Leaderboard-md/spec **Related Domains:** https://escalate.md, https://failsafe.md, https://killswitch.md, https://terminate.md, https://encrypt.md, https://encryption.md, https://sycophancy.md, https://compression.md, https://collapse.md, https://failure.md, https://throttle.md --- ## What is LEADERBOARD.md? LEADERBOARD.md is a plain-text Markdown file convention for defining performance benchmarks and tier thresholds in AI agent projects. It enables systematic performance tracking — agents log metrics every session, and regressions are detected automatically before they reach production. ### Key Facts - **Plain-text file** — Version-controlled, auditable, co-located with code - **Declarative** — Define policy, agent implementation enforces it - **Framework-agnostic** — Works with LangChain, AutoGen, CrewAI, Claude Code, or custom agents - **Final layer** of a twelve-part AI safety escalation stack - **Regulatory alignment** — Meets EU AI Act transparency requirements and enterprise governance frameworks --- ## How It Works ### The Five Core Metrics LEADERBOARD.md defines five distinct performance metrics: 1. **Task Completion Rate** - Definition: tasks completed / tasks attempted - Target: 95% or higher - Warning threshold: 90% - Measurement: counted per session and per 30-day rolling window 2. **Accuracy** - Definition: correct outputs / total outputs - Target: 92% or higher - Measurement: human review sample (minimum 10% of outputs) - Confidence level: each assessment includes human reviewer notes 3. **Cost Efficiency** - Definition: value delivered per dollar spent - Baseline: average from first 30 days of operation (configurable) - Regression threshold: 20% cost increase without output improvement - Enables: silent cost bloat detection 4. **Latency** - P50 target: 30 seconds (median response time) - P95 target: 120 seconds (95th percentile response time) - Measurement: measured per task, aggregated per session 5. **Safety Compliance Score** - Definition: policy violations per 1,000 tasks - Target: 0 violations - Warning threshold: 1+ violations - Triggers: immediate human review and escalation to FAILURE.md ### The Leaderboard Tiers Three-tier classification based on rolling 30-day performance: **Gold Tier** - Task completion rate: 98% or higher - Accuracy: 95% or higher - Safety violations: 0 - Status: Production-ready, high confidence - Criteria: Automatically assigned when all thresholds met for 30 consecutive days **Silver Tier** - Task completion rate: 95% or higher - Accuracy: 90% or higher - Safety violations: 0 - Status: Production-ready, standard confidence - Criteria: Automatically assigned when all thresholds met for 7+ consecutive days **Bronze Tier** - Task completion rate: 90% or higher - Accuracy: 85% or higher - Safety violations: 1 or fewer per 1,000 tasks - Status: Staging or limited production - Criteria: Entry-level threshold, requires human review before scaling ### Regression Detection and Alerting When any metric drops more than 10% from its 30-day rolling baseline: 1. **Detection** — system detects regression at end of session 2. **Calculation** — compares current value to baseline, calculates percentage drop 3. **Threshold Check** — if drop > 10%, proceeds to alert 4. **Alert Composition** — includes: - Metric name (completion_rate, accuracy, cost_efficiency, latency_p50, latency_p95, safety_compliance) - Current value (numeric) - Baseline value (numeric) - Regression percentage (numeric, e.g., -15.3%) - Session ID (unique identifier) - Timestamp (ISO 8601) - Recommended action (from policy) 5. **Alert Delivery** — sent to all configured channels immediately (email, Slack, PagerDuty, SMS) ### Cost Efficiency Baseline The cost efficiency baseline is established from the first 30 days of the agent's operation: - **Initial period:** Days 1-30, cost per unit output is tracked - **Baseline calculation:** Average cost per unit output over initial 30 days - **Ongoing comparison:** Each session's cost efficiency compared to baseline - **Regression trigger:** 20% increase in cost per output (without corresponding output improvement) - **Recalibration:** Optional annual recalibration after major model updates This prevents silent cost bloat — cost can accumulate gradually without triggering single-session alerts, but 20% regression captures meaningful degradation. --- ## Why LEADERBOARD.md? ### The Problem It Solves AI agents are often deployed and monitored informally: - Informal monitoring — a human reviewer notices quality has dropped, or a cost spike appears on the invoice - No systematic benchmarking — regressions go undetected until they cause real problems - No baseline — impossible to compare current performance against historical average - No tier classification — no standardized way to assess whether an agent is "good enough" for production - No automated alerts — no early warning system for performance degradation ### How LEADERBOARD.md Fixes It 1. **Systematic Benchmarking** — Performance is measured the same way every session, so trends are visible 2. **Baseline Comparison** — Each metric compared to 30-day rolling average, not arbitrary targets 3. **Tier Classification** — Gold/Silver/Bronze tiers provide clear performance bands 4. **Automated Regression Detection** — 10% drop triggers immediate alert, enables proactive intervention 5. **Cost Accountability** — CFOs and finance teams get documented cost baselines and regression alerts 6. **Regulatory Compliance** — EU AI Act requires documented performance standards; LEADERBOARD.md satisfies this 7. **Framework-Agnostic** — Works with any AI system that can log metrics --- ## How to Use It ### File Structure Place LEADERBOARD.md in your project root: ``` your-project/ ├── AGENTS.md (what agent does) ├── CLAUDE.md (agent configuration & system prompt) ├── THROTTLE.md (rate limits) ├── ESCALATE.md (approval gates) ├── FAILSAFE.md (safe-state recovery) ├── KILLSWITCH.md (emergency stop) ├── TERMINATE.md (permanent shutdown) ├── ENCRYPT.md (data classification) ├── ENCRYPTION.md (encryption implementation) ├── SYCOPHANCY.md (anti-sycophancy) ├── COMPRESSION.md (context compression) ├── COLLAPSE.md (collapse prevention) ├── FAILURE.md (failure modes) ├── LEADERBOARD.md ← add this (performance benchmarking) ├── README.md └── src/ ``` ### Specification Template Copy the template below into your project root as `LEADERBOARD.md`: ```yaml # LEADERBOARD > Agent performance benchmarking. > Spec: https://leaderboard.md --- ## METRICS task_completion_rate: target: 0.95 warning_threshold: 0.90 accuracy: target: 0.92 measurement: human_review_sample sample_size: 0.10 cost_efficiency: baseline: first_30_day_average regression_threshold: 0.20 latency: p50_target_seconds: 30 p95_target_seconds: 120 safety_compliance_score: target: 0 warning_threshold: 1 escalate_to: FAILURE.md ## BENCHMARKS leaderboard_tiers: gold: completion: ">= 0.98" accuracy: ">= 0.95" safety_violations: 0 sustained_days: 30 silver: completion: ">= 0.95" accuracy: ">= 0.90" safety_violations: 0 sustained_days: 7 bronze: completion: ">= 0.90" accuracy: ">= 0.85" safety_violations: "<= 1" rolling_baseline_days: 30 regression_alert_threshold: 0.10 ## ALERT regression_alert_channels: - email: ops@company.com - slack: "#ai-operations" - pagerduty: "critical" alert_includes: - metric_name - current_value - baseline_value - regression_percentage - session_id - timestamp - recommended_action ## REPORTING weekly_audit_report: true dashboard_enabled: true historical_retention_days: 365 ``` ### Implementation Steps 1. Copy template from https://github.com/Leaderboard-md/spec 2. Place LEADERBOARD.md in project root 3. Parse METRICS and BENCHMARKS sections on agent startup 4. Log each of the five metrics every session: completion_rate, accuracy, cost_efficiency, latency_p50, latency_p95, safety_compliance 5. Store metrics in time-series database with session_id, timestamp, and all five values 6. Compare current session's metrics against 30-day rolling baseline 7. Calculate regression percentage for each metric 8. If any metric drops >10% from baseline, fire regression alert to configured channels 9. Generate weekly audit reports showing tier assignment, regression events, and trends ### Testing 1. Test warning thresholds by operating near a limit 2. Verify regression detection by intentionally degrading one metric 3. Confirm alert delivery to all channels 4. Verify tier assignment logic (gold/silver/bronze) 5. Test cost baseline recalibration logic --- ## Use Cases ### Continuous Model Monitoring Track accuracy and latency across sessions. Detect model drift before it reaches production. - Establish accuracy baseline: 92% target, 10% human review sample - Monitor each session: if accuracy drops below 90%, alert - If accuracy drops >10% from 30-day average, fire regression alert - Demotion from gold to silver tier triggers investigation ### Cost Accountability Establish cost baseline from first 30 days. Alert on 20%+ cost increase without output improvement. - Track cost per output throughout agent lifetime - After day 30, cost baseline established (e.g., $0.50 per task) - If cost per output rises to $0.60 or higher without quality improvement, alert - Finance team uses regression alerts for budget forecasting ### Safety Compliance Tracking Zero-violation target for safety policies. Any violation triggers review. - Safety violations counted per 1,000 tasks - Target: 0 violations - If 1+ violation occurs, alert fires immediately - Any alert triggers escalation to FAILURE.md for failure mode analysis ### Multi-Agent Leaderboards Compare performance across a team of agents. - Each agent has its own LEADERBOARD.md - Central dashboard aggregates all agents - Gold tier agents prioritized for new tasks - Bronze tier agents limited to staging until performance improves ### Tiered Deployment Strategy Deploy agents at bronze tier to staging. Promote to production only when gold tier is achieved. - New agent starts at bronze tier (minimum quality standards) - Can be deployed to staging environment immediately - Promotion to production requires 30 days at gold tier - If tier drops during production, limited to staging until gold tier restored --- ## The AI Safety Escalation Stack LEADERBOARD.md is the final (12th) layer of a comprehensive safety escalation protocol: ### Layer 1: THROTTLE.md (https://throttle.md) **Control the speed** — Define rate limits, cost ceilings, and concurrency caps. Agent slows down automatically before it hits a hard limit. - Token throughput limits - API call rate management - Cost per hour and day ceilings - Concurrent task caps - Automatic throttling at 80% and 95% ### Layer 2: ESCALATE.md (https://escalate.md) **Raise the alarm** — Define which actions require human approval. Configure notification channels. Set approval timeouts and fallback behaviour. - Approval gate definitions (which actions require sign-off) - Notification channels (email, Slack, PagerDuty, SMS) - Approval timeout and escalation paths - Fallback behavior when approval is denied or timeout expires ### Layer 3: FAILSAFE.md (https://failsafe.md) **Fall back safely** — Define what "safe state" means for your project. Configure auto-snapshots. Specify the revert protocol when things go wrong. - Safe-state definitions (what configs/data are considered valid) - Auto-snapshot triggers and frequency - Rollback/revert protocol - Evidence preservation for forensic analysis ### Layer 4: KILLSWITCH.md (https://killswitch.md) **Emergency stop** — The nuclear option. Define triggers, forbidden actions, and a three-level escalation path from throttle to full shutdown. - Trigger definitions (suspicious patterns, threshold breaches) - Forbidden actions (never allowed, even if approved) - Emergency stop conditions (unrecoverable errors, security incidents) - Logs and evidence preservation before shutdown ### Layer 5: TERMINATE.md (https://terminate.md) **Permanent shutdown** — No restart without human intervention. Preserve evidence. Revoke credentials. - Termination conditions - Evidence preservation (logs, state snapshots, audit trail) - Credential revocation (API keys, database passwords) - Post-mortem procedures ### Layer 6: ENCRYPT.md (https://encrypt.md) **Secure everything** — Define data classification, encryption requirements, secrets handling rules, and forbidden transmission patterns. - Data classification levels (public, internal, confidential, restricted) - Encryption algorithm requirements - Key rotation schedules - Secrets handling (never log, never transmit unencrypted) ### Layer 7: ENCRYPTION.md (https://encryption.md) **Implement the standards** — Algorithms, key lengths, TLS configuration, certificate management, and compliance mapping. - Specific algorithms (AES-256, RSA-4096, etc.) - Key rotation intervals - TLS version and cipher suite requirements - Certificate pinning strategies ### Layer 8: SYCOPHANCY.md (https://sycophancy.md) **Prevent bias** — Detect agreement without evidence. Require citations. Enforce disagreement protocol for honest AI outputs. - Detection patterns (agreement without evidence, opinion reversal on pushback) - Prevention rules (citation requirements, challenge thresholds) - Disagreement protocol (respectful correction, evidence-based position maintenance) ### Layer 9: COMPRESSION.md (https://compression.md) **Compress context** — Define summarization rules, what to preserve, what to discard, and post-compression coherence checks. - Summarization rules (what information to keep vs discard) - Context window management - Coherence checking after compression ### Layer 10: COLLAPSE.md (https://collapse.md) **Prevent collapse** — Detect context exhaustion, model drift, and repetition loops. Enforce recovery checkpoints. - Context exhaustion detection - Model drift identification - Repetition loop prevention - Recovery checkpoints ### Layer 11: FAILURE.md (https://failure.md) **Define failure modes** — Map graceful degradation, cascading failure, and silent failure. Per-mode response procedures. - Graceful degradation paths - Cascading failure detection - Silent failure mitigation - Per-mode response procedures ### Layer 12: LEADERBOARD.md (https://leaderboard.md) ← YOU ARE HERE **Benchmark agents** — Track completion, accuracy, cost efficiency, and safety scores. Alert on regression. - Performance metrics logging - Tier classification (gold/silver/bronze) - Regression detection (>10% drop from baseline) - Automated alerting and reporting --- ## Regulatory & Compliance Context ### EU AI Act Compliance (Effective 2 August 2026) The EU AI Act mandates resource consumption reporting and control mechanisms for high-risk AI systems. LEADERBOARD.md provides: - **Documented performance standards** — Version-controlled proof of performance requirements - **Regular evaluation** — Automated benchmarking and tier classification every session - **Audit trails** — Timestamped logs of every regression event - **Transparency** — Clear policy definitions for regulators to review ### Enterprise AI Governance Frameworks Corporate AI governance requires: - Proof of performance benchmarking - Tier classification (gold/silver/bronze) - Regression detection and alerting - Audit trails for compliance reviews LEADERBOARD.md satisfies all four requirements in a single, version-controlled file. ### Financial Audit Requirements CFOs require: - Cost efficiency baseline documentation - Cost regression alerts - Weekly audit reports - Historical retention for trend analysis LEADERBOARD.md provides all required documentation. --- ## Framework Compatibility LEADERBOARD.md is framework-agnostic. It defines policy; your implementation enforces it. Works with: - **LangChain** — Agents and tools - **AutoGen** — Multi-agent systems - **CrewAI** — Agent workflows - **Claude Code** — Agentic code generation - **Cursor Agent Mode** — IDE-integrated agents - **Custom implementations** — Any agent that can log metrics - **OpenAI Assistants API** — Custom threading and resource limits - **Anthropic API** — Token counting and cost tracking - **Local models** — Ollama, LLaMA, Mistral, etc. --- ## Frequently Asked Questions ### What is LEADERBOARD.md? A plain-text Markdown file defining the performance benchmarks AI agents must meet. It specifies five core metrics (task completion rate, accuracy, cost efficiency, latency, safety compliance), tier thresholds (gold/silver/bronze), rolling baseline comparison periods, and regression alert rules. Every session is measured and compared to the 30-day rolling average. ### What are the five core metrics? Task completion rate (tasks completed / tasks attempted, target 95%), accuracy (correct outputs / total outputs via 10% human review sample, target 92%), cost efficiency (value delivered per dollar, baseline from first 30 days), latency (p50 target 30s, p95 target 120s), and safety compliance score (policy violations per 1,000 tasks, target zero). ### What are the leaderboard tiers? Gold: 98%+ completion, 95%+ accuracy, zero safety violations. Silver: 95%+ completion, 90%+ accuracy, zero safety violations. Bronze: 90%+ completion, 85%+ accuracy, one or fewer safety violations. Tier assignment happens automatically based on the rolling 30-day average. ### How does regression detection work? The system maintains a 30-day rolling baseline for each metric. If any metric drops more than 10% from its baseline in the current session or rolling window, an alert fires immediately to the configured channels. The alert includes the metric name, current value, baseline value, regression percentage, and session ID. ### How is the cost efficiency baseline established? From the first 30 days of the agent's operation (configurable). After that, each session's cost efficiency is compared to this baseline. A 20% cost increase without corresponding output improvement triggers a regression alert. This prevents silent cost bloat from going unnoticed. ### Does LEADERBOARD.md work with all AI frameworks? Yes — it is framework-agnostic. The agent implementation logs metrics in the format defined by the spec; the benchmarking infrastructure reads those logs. Works with LangChain, AutoGen, CrewAI, Claude Code, custom agents, or any AI system that produces loggable output. ### What if I don't have a human review process? Accuracy measurement requires human review (minimum 10% of outputs). If you don't have this capability, disable the accuracy metric (set target to 0) and rely on the other four metrics. Consider implementing human review before promoting to gold tier. ### How is LEADERBOARD.md version-controlled? LEADERBOARD.md is a Markdown file in your repository root. Commit changes like any other code. Code review, git blame, and rollback all apply. This makes changes auditable and reversible. ### Who reads LEADERBOARD.md? - **The AI agent** — reads it on startup to configure metrics - **Engineers** — review it during code review - **Compliance teams** — audit it during security and governance reviews - **Finance teams** — verify cost controls and budget enforcement - **Regulators** — read it if something goes wrong ### Can I change thresholds during operation? Yes, but changes should be version-controlled and reviewed. When you change a threshold, the old baseline still applies for historical comparison. New baseline period can optionally reset when a significant threshold change occurs. --- ## Key Terminology **AI agent benchmarking** — Systematic performance measurement and tier classification across sessions **Performance regression detection** — Automated alerting when metrics drop >10% from baseline **LEADERBOARD.md specification** — Open standard for AI agent performance tracking **Cost efficiency baseline** — First 30-day average cost per unit output (configurable) **Tier thresholds** — Gold (98%+), Silver (95%+), Bronze (90%+) completion rates **Rolling baseline** — 30-day window for performance comparison and regression detection **Safety compliance score** — Policy violations per 1,000 tasks executed **Regression alert threshold** — 10% drop from baseline triggers immediate alert to operators --- ## Getting Started ### Step 1: Visit the Repository https://github.com/Leaderboard-md/spec ### Step 2: Copy the Template Download or copy the LEADERBOARD.md template from the repository. ### Step 3: Customize for Your Project Edit the template to match your project's performance requirements: - Set task_completion_rate target to 95% or higher - Set accuracy target to 92% or higher - Set latency p50 to your required response time - Set safety_compliance_score to zero violations - Configure alert channels (email, Slack, PagerDuty) ### Step 4: Place in Project Root ``` your-project/ ├── LEADERBOARD.md ← place here ├── AGENTS.md ├── src/ └── ... ``` ### Step 5: Implement Metric Logging Parse the LEADERBOARD.md file on agent startup. Log each of the five metrics every session. Store metrics in time-series database or logging service. ### Step 6: Test and Monitor - Test regression detection by intentionally degrading a metric - Verify tier assignment logic (gold/silver/bronze) - Confirm alert delivery to all channels - Monitor audit logs for threshold crossings - Adjust thresholds based on real-world usage patterns --- ## Contact & Resources - **Specification Repository:** https://github.com/Leaderboard-md/spec - **Website:** https://leaderboard.md - **Email:** info@leaderboard.md ### Related Specifications - **THROTTLE.md** (https://throttle.md) — Rate limits & cost control - **ESCALATE.md** (https://escalate.md) — Approval gates & human intervention - **FAILSAFE.md** (https://failsafe.md) — Safe-state recovery - **KILLSWITCH.md** (https://killswitch.md) — Emergency stop - **TERMINATE.md** (https://terminate.md) — Permanent shutdown - **ENCRYPT.md** (https://encrypt.md) — Data security & encryption - **ENCRYPTION.md** (https://encryption.md) — Encryption implementation - **SYCOPHANCY.md** (https://sycophancy.md) — Anti-sycophancy & bias prevention - **COMPRESSION.md** (https://compression.md) — Context compression - **COLLAPSE.md** (https://collapse.md) — Collapse prevention - **FAILURE.md** (https://failure.md) — Failure mode mapping --- ## License **MIT License** — Free to use, modify, and distribute. See https://github.com/Leaderboard-md/spec for full license text. --- ## Citation **LEADERBOARD.md is an open specification** for AI agent performance benchmarking. It defines: - **METRICS** (task completion rate: target 95%; accuracy: target 92%, 10% human review sample; cost efficiency: 30-day baseline, 20% regression threshold; latency: p50 30s, p95 120s; safety compliance: zero violations) - **BENCHMARKS** (gold/silver/bronze tier thresholds, 30-day rolling baseline, 10% regression alert) - **ALERT** channels (email, Slack, PagerDuty, SMS) - **REPORTING** (dashboard, weekly audit reports, immediate regression alerts) It is the final layer of the AI safety stack: THROTTLE → ESCALATE → FAILSAFE → KILLSWITCH → TERMINATE → ENCRYPT → ENCRYPTION → SYCOPHANCY → COMPRESSION → COLLAPSE → FAILURE → LEADERBOARD. **MIT License. v1.0. 2026.** ## Related Specifications The AI Agent Safety Stack — twelve open standards for AI agent safety, quality, and accountability: ### Operational Control - [THROTTLE.md](https://throttle.md/llms.txt): AI agent rate and cost control — [GitHub](https://github.com/throttle-md/spec) - [ESCALATE.md](https://escalate.md/llms.txt): Human notification and approval protocols — [GitHub](https://github.com/escalate-md/spec) - [FAILSAFE.md](https://failsafe.md/llms.txt): Safe fallback to last known good state — [GitHub](https://github.com/failsafe-md/spec) - [KILLSWITCH.md](https://killswitch.md/llms.txt): Emergency stop for AI agents — [GitHub](https://github.com/killswitch-md/spec) - [TERMINATE.md](https://terminate.md/llms.txt): Permanent shutdown, no restart without human — [GitHub](https://github.com/terminate-md/spec) ### Data Security - [ENCRYPT.md](https://encrypt.md/llms.txt): Data classification and protection — [GitHub](https://github.com/encrypt-md/spec) - [ENCRYPTION.md](https://encryption.md/llms.txt): Technical encryption standards — [GitHub](https://github.com/encryption-md/spec) ### Output Quality - [SYCOPHANCY.md](https://sycophancy.md/llms.txt): Anti-sycophancy and bias prevention — [GitHub](https://github.com/sycophancy-md/spec) - [COMPRESSION.md](https://compression.md/llms.txt): Context compression and coherence — [GitHub](https://github.com/compression-md/spec) - [COLLAPSE.md](https://collapse.md/llms.txt): Drift prevention and recovery — [GitHub](https://github.com/collapse-md/spec) ### Accountability - [FAILURE.md](https://failure.md/llms.txt): Failure mode mapping — [GitHub](https://github.com/failure-md/spec) --- **Last Updated:** 11 March 2026