Refreshing general-knowledge benchmarks without leakage
Static QA tests are showing their age; how to refresh without rewarding recollection.

General-knowledge benchmarks like MMLU (Massive Multitask Language Understanding—Hendrycks et al., 2020; HF dataset card) gave us a shared yardstick for LLMs, but as models absorb more of the internet, their usefulness erodes, and the tests along with them. Recent re-releases and companions—MMLU-Redux (Gema et al., 2024), MMLU-Pro (Wang et al., 2024), MMLU-CF (Zhao et al., 2024), and LiveBench (White et al., 2024)—aim to restore headroom and reduce contamination.
In this issue we’re going to look at what’s actually changed, how to use these sets without leaking your own evals, and what to log so your scores mean something to buyers, regulators, and your future self.
Why this matters
Almost universally, benchmarks have been used to drive enterprise procurement and launch adoption. When a test leaks, you get inflated, brittle scores and poor cross-lab comparability—The Emperor’s New Clothes in Benchmarking? (ICML 2025) shows in controlled “mild” and “intensive” contamination scenarios (10 LLMs, 5 benchmarks) that benchmark data contamination leads to falsely inflated performance estimates, and that common “update the test” mitigations don’t reliably restore contamination resistance.
The last year saw three distinct approaches to “refreshing” MMLU-style evals:
MMLU-Redux: Re-annotation to correct labels and item quality.
MMLU-Pro: Hardening via trickier items and more answer choices.
MMLU-CF: Gating & split discipline with a closed test and public validation.
Gotchas and fixes
Four common failures and practical controls to ship with your scorecard:
Label errors (MMLU-Redux) — legacy MMLU contains mislabeled/ambiguous items that create artificial deltas. Fix: pin the corrected set (v2.0 DOI) and report dual-key accuracy (original vs. corrected); slice by error tag to localize regressions. See Gema et al., 2024 and the HF dataset card.
Guessing headroom (MMLU-Pro) — 4-choice MCQs inflate chance performance and amplify prompt sensitivity. Fix: switch to the 10-choice variant, log the 10% random-guess baseline, pin choice order/shuffle seed, and report calibrated accuracy alongside raw. See Wang et al., 2024 and the HF dataset card.
Split hygiene (MMLU-CF) — mixing tuning and reporting on a public split (or touching the closed test during dev) invites leakage. Fix: select on the validation split, report via the closed test, record submission date/access path, and log the validation license (CDLA-Permissive-2.0). See the paper, HF dataset, and CDLA text.
Rolling drift (LiveBench) — month-to-month rotations and grader config changes break cross-month comparability. Fix: treat the release month as part of the benchmark ID, pin commit/task subset, record auto-grader/judge config, and compare models within the same month. See the paper, site, and GitHub.
Evidence
This piece was grounded in four refresh paths for general-knowledge evals: MMLU-Redux (v2.0) for label corrections and error tags, MMLU-Pro for 10-choice reasoning that lowers guess headroom, MMLU-CF for a public-val/closed-test split with decontamination rules, and LiveBench for dated monthly drops with automatic scoring. See full detail below.
MMLU-Redux (v2.0)
Label-corrected re-annotation across all 57 subjects; useful for dual-key reporting and error-type slices.
Sourcing & access: arXiv paper, HF dataset. Access: open.
What it is: Multiple-choice questions with corrected keys and error tags spanning all 57 MMLU subjects.
Size & splits: ~5.7k re-annotated items; v2.0 covers all subjects (earlier Redux subsets were smaller).
Provenance: Re-annotations of cais/mmlu following a documented error protocol.
License & use rights: CC-BY-4.0 (redistribution with attribution; no purpose limits).
Safety & compliance: Notes on ambiguous/problematic items; tracked by HF.
Contamination risk: Same subject space as MMLU; use dual-key reporting to detect mislabeled-induced deltas.
Baselines/benchmarks: Paper reports model reorderings under corrected keys.
Operational notes: Pin the dataset DOI and commit hash; log whether scores are original-key vs corrected-key.
What to watch next: NAACL 2025 camera-ready + expanded error taxonomy. Hugging Face
MMLU-Pro (2024-06-03)
Ten-choice, harder questions to cut chance performance and reduce prompt sensitivity; good for calibrated-accuracy reporting.
Sourcing & access: arXiv paper, HF dataset, GitHub. Access: open.
What it is: A 12k-item, 10-choice MCQ suite emphasizing reasoning.
Size & splits: ~12,102 test items; small validation slice.
Provenance: Curated from multiple sources.
License & use rights: MIT (redistribution permitted with notice).
Safety & compliance: General-knowledge items; standard PD risk; no minors data.
Contamination risk: Some overlapable stems; pin choice order/shuffle seeds.
Baselines/benchmarks: Paper shows 16–33% drops vs. MMLU and lower prompt sensitivity.
Operational notes: Record random-guess baseline (10%), calibrated accuracy, and exact commit/DOI.
What to watch next: Variants (e.g., ProX) and per-domain harder subsets.
MMLU-CF (2024-12-19)
Contamination-limited benchmark with a public validation set and closed test; suitable for clean model selection + sealed reporting.
Sourcing & access: arXiv paper, validation on Hugging Face; test via GitHub. Access: val open / test closed.
What it is: 20k MCQs across 14 fields; decontamination rules; public val mirrors closed-test difficulty.
Size & splits: 10k val, 10k test; dev slice for quick checks.
Provenance: Screened from >200B public docs; multi-stage cleaning/LLM checks.
License & use rights: CDLA-Permissive-2.0 on validation (redistribution with attribution; patent peace for data). Test set is closed; redistribution prohibited.
Safety & compliance: Public-web sourcing; provider claims safety screens; treat as mixed PD.
Contamination risk: Val is public; avoid tuning on leaked items; report val-selected/test-reported with submission date.
Baselines/benchmarks: Paper reports GPT-4o ~72% (0–5-shot) on test to show headroom.
Operational notes: Log access path, submission timestamp, and val license; archive the license text with your run.
What to watch next: Additional subjects and periodic refresh of the closed test.
Contamination, personal data, and redistribution
Public QA invites overlap with pretraining. For the MMLU family, assume non-zero contamination unless you adopt a gate (CF-style test) or a rolling “newsy” mix (LiveBench). If you run internal judge swaps or time-shifted subsets, report those as freshness controls alongside headline accuracy—see my recent article on Judge swaps, drifting agent evals.
For licensing, prefer data licenses with explicit rights for results (CDLA-Permissive-2.0) over software-only licenses when redistributing dataset derivatives; for CC-BY-4.0 sets, ensure downstream artifacts retain attribution and license links.
For tooling and harnesses, document the evaluator (e.g., OpenCompass and its LLM-as-judge guidance) so buyers can reproduce scoring.
Operational guidance
A lightweight checklist for teams that procure/evaluate/ship:
Benchmark ID: Record dataset name and version/DOI/commit + release date (e.g., LiveBench Apr 7 2025 reasoning task from HF).
License discipline: Log canonical license names (e.g., “CC-BY-4.0”, “CDLA-Permissive-2.0”). Link to the legal text (SPDX list; CDLA-Permissive-2.0; MIT; Apache-2.0).
Freshness controls: Run at least one time-shifted slice or judge swap; include both in the report.
Prompt & decoding: Store exact prompt templates, sampling params, and any CoT use; publish random-guess baselines (10-way for MMLU-Pro).
Gatekeeping: Separate selection (public val) from reporting (closed test) on MMLU-CF; keep access logs.
Contamination logging: Record URL-level hits if you reconstruct items; keep allow-list policies and dedup stats if you build variants.
Tooling provenance: Pin evaluator versions (e.g., OpenCompass docs) and judge configs.
What this means for teams
General-knowledge evals are moving from single numbers to paired readings. Expect buyers to ask for one open/public figure (MMLU-Redux or MMLU-Pro) and one gated/rolling figure (MMLU-CF test or LiveBench by month). Closed-test governance with public validation will become the norm for headline claims; rolling sets add recency but break cross-month comparability. Plan for calibration ladders—dated anchors that make score shifts legible—rather than a frozen leaderboard.
Open problem: A shared, dated calibration ladder for general knowledge (specifying sampling window, leakage policy, and scoring harness) so teams can compare across months and between open vs. closed tests without guesswork.
Actionable takeaway: In your next model card, report two lines per benchmark:
Open/public (Redux or Pro) with a freshness control, and
Gated/rolling (CF test or LiveBench-‹month›) with version/date and access path.
Pin evaluator versions, licenses, and prompts; keep URL-level artifacts and, for gated sets, store hashes/IDs rather than raw items.
Related releases
LiveBench — rolling, contamination-limited tasks with objective auto-grading; treat the release month as part of the benchmark ID. arXiv · site · GitHub.
MMLU-ProX (multilingual) — Pro-style difficulty extended to 13 languages; surfaces cross-lingual degradation and cultural variance. arXiv · project page.
Benchmarks & evals
MMLU-Pro leaderboard — public Space for side-by-side results; handy for sanity-checking your harness settings. Hugging Face.
LiveBench leaderboard — dated drops + objective scoring; compare models within the same month to avoid rotation drift. site, arXiv paper
OpenCompass — end-to-end evaluation stack with task partitioning, judge options, and MMLU-family support; pin version/config. docs · GitHub.
Lighteval — HF’s evaluation toolkit with task registry and sample-level artifacts; good for storing per-item results and commits. docs · GitHub.
EleutherAI lm-evaluation-harness — the long-running baseline harness many papers still report; useful for cross-checking prompts/metrics. GitHub.
Requests for datasets
We (Brickroad) pay a commission on the first closed deal when a referral leads to a purchase or placement. Details and terms → dataset.news/pitch.
Fantasy & Sci-Fi Imagery (CGI-enhanced, 4K+)
Specs: Fully rendered stills of space battles, alien worlds, mythical creatures, futuristic cityscapes, magical realms, dystopian landscapes, and character-driven action scenes. 4K+ resolution preferred; consistent aesthetic; annotations for scene description, character presence, lighting cues, etc. Sources may include 3D modeling, photorealistic rendering (Unreal/Blender/Maya), or high-end compositing. AI-generated images considered with strong fidelity and clear rights.
Rights: Licensor must own or control all rights; commercial ML training/redistribution permitted.
Target price: $0.10 per image
Use cases: image-to-image, text-to-image
Uncompressed Raw Video from Professional Cinema Cameras
Specs: Original camera files (uncompressed or lightly compressed RAW/ProRes) from RED, Blackmagic, ARRI Alexa, Sony Venice, Canon C-series, etc. Include metadata (resolution, frame rate, bit depth, lens, shooting conditions). Diverse environments (indoor/outdoor, low-light, high-motion); mix of handheld, gimbal, drone, tripod. Preference for consistent framing, color accuracy, and sensor-level detail. Minimum volume: ≥5 TB.
Rights: Submitters must confirm data ownership and grant commercial ML training rights; publisher indemnity preferred.
Target price: $400–$1,000 per TB (resolution, diversity, and metadata quality dependent)
Use cases: fine-tuning generative video models; post-production automation
Drone Videos of Cities (NA & EU)
Specs: High-quality drone footage of North American and European cities. Include in-camera and on-drone metadata: GPS, accelerometer, gimbal data. Variety of routes, altitudes, and lighting conditions encouraged.
Rights: Licensed for commercial AI/ML training; submitters attest to ownership/control of rights.
Target price: $5.00 per minute
Use cases: object detection, navigation, scene understanding
We don’t feature or broker datasets with unclear rights or unsafe personal data. Always confirm licences & provenance before commercial use.
If you need an NDA or have questions, email me at michael@brickroadapp.com.
What did this piece get right—or miss—about leakage and benchmark hygiene? Have you experienced contamination yourself or had a versioning policy that stuck? Leave a comment with one thing that broke and one control that worked.
If you buy data, the Brickroad marketplace is open: filter, preview, and license only the slices you need with standardized terms. Try the marketplace today.
If you supply data and want to list inventory, upload directly or email michael@brickroadapp.com.
— Michael Gordon, cofounder & CTO at Brickroad