Dataset News

Refreshing general-knowledge benchmarks without leakage

Michael Gordon — Thu, 21 Aug 2025 07:14:57 GMT

When results depend on recollection — film frame from Eternal Sunshine of the Spotless Mind (2004). © Focus Features. Used for commentary.

General-knowledge benchmarks like MMLU (Massive Multitask Language Understanding—Hendrycks et al., 2020; HF dataset card) gave us a shared yardstick for LLMs, but as models absorb more of the internet, their usefulness erodes, and the tests along with them. Recent re-releases and companions—MMLU-Redux (Gema et al., 2024), MMLU-Pro (Wang et al., 2024), MMLU-CF (Zhao et al., 2024), and LiveBench (White et al., 2024)—aim to restore headroom and reduce contamination.

In this issue we’re going to look at what’s actually changed, how to use these sets without leaking your own evals, and what to log so your scores mean something to buyers, regulators, and your future self.

Why this matters

Almost universally, benchmarks have been used to drive enterprise procurement and launch adoption. When a test leaks, you get inflated, brittle scores and poor cross-lab comparability—The Emperor’s New Clothes in Benchmarking? (ICML 2025) shows in controlled “mild” and “intensive” contamination scenarios (10 LLMs, 5 benchmarks) that benchmark data contamination leads to falsely inflated performance estimates, and that common “update the test” mitigations don’t reliably restore contamination resistance.

The last year saw three distinct approaches to “refreshing” MMLU-style evals:

MMLU-Redux: Re-annotation to correct labels and item quality.
MMLU-Pro: Hardening via trickier items and more answer choices.
MMLU-CF: Gating & split discipline with a closed test and public validation.

Gotchas and fixes

Four common failures and practical controls to ship with your scorecard:

Label errors (MMLU-Redux) — legacy MMLU contains mislabeled/ambiguous items that create artificial deltas. Fix: pin the corrected set (v2.0 DOI) and report dual-key accuracy (original vs. corrected); slice by error tag to localize regressions. See Gema et al., 2024 and the HF dataset card.
Guessing headroom (MMLU-Pro) — 4-choice MCQs inflate chance performance and amplify prompt sensitivity. Fix: switch to the 10-choice variant, log the 10% random-guess baseline, pin choice order/shuffle seed, and report calibrated accuracy alongside raw. See Wang et al., 2024 and the HF dataset card.
Split hygiene (MMLU-CF) — mixing tuning and reporting on a public split (or touching the closed test during dev) invites leakage. Fix: select on the validation split, report via the closed test, record submission date/access path, and log the validation license (CDLA-Permissive-2.0). See the paper, HF dataset, and CDLA text.
Rolling drift (LiveBench) — month-to-month rotations and grader config changes break cross-month comparability. Fix: treat the release month as part of the benchmark ID, pin commit/task subset, record auto-grader/judge config, and compare models within the same month. See the paper, site, and GitHub.

Evidence

This piece was grounded in four refresh paths for general-knowledge evals: MMLU-Redux (v2.0) for label corrections and error tags, MMLU-Pro for 10-choice reasoning that lowers guess headroom, MMLU-CF for a public-val/closed-test split with decontamination rules, and LiveBench for dated monthly drops with automatic scoring. See full detail below.

MMLU-Redux (v2.0)

Label-corrected re-annotation across all 57 subjects; useful for dual-key reporting and error-type slices.

Sourcing & access: arXiv paper, HF dataset. Access: open.
What it is: Multiple-choice questions with corrected keys and error tags spanning all 57 MMLU subjects.
Size & splits: ~5.7k re-annotated items; v2.0 covers all subjects (earlier Redux subsets were smaller).
Provenance: Re-annotations of cais/mmlu following a documented error protocol.
License & use rights: CC-BY-4.0 (redistribution with attribution; no purpose limits).
Safety & compliance: Notes on ambiguous/problematic items; tracked by HF.
Contamination risk: Same subject space as MMLU; use dual-key reporting to detect mislabeled-induced deltas.
Baselines/benchmarks: Paper reports model reorderings under corrected keys.
Operational notes: Pin the dataset DOI and commit hash; log whether scores are original-key vs corrected-key.
What to watch next: NAACL 2025 camera-ready + expanded error taxonomy. Hugging Face

MMLU-Pro (2024-06-03)

Ten-choice, harder questions to cut chance performance and reduce prompt sensitivity; good for calibrated-accuracy reporting.

Sourcing & access: arXiv paper, HF dataset, GitHub. Access: open.
What it is: A 12k-item, 10-choice MCQ suite emphasizing reasoning.
Size & splits: ~12,102 test items; small validation slice.
Provenance: Curated from multiple sources.
License & use rights: MIT (redistribution permitted with notice).
Safety & compliance: General-knowledge items; standard PD risk; no minors data.
Contamination risk: Some overlapable stems; pin choice order/shuffle seeds.
Baselines/benchmarks: Paper shows 16–33% drops vs. MMLU and lower prompt sensitivity.
Operational notes: Record random-guess baseline (10%), calibrated accuracy, and exact commit/DOI.
What to watch next: Variants (e.g., ProX) and per-domain harder subsets.

MMLU-CF (2024-12-19)

Contamination-limited benchmark with a public validation set and closed test; suitable for clean model selection + sealed reporting.

Sourcing & access: arXiv paper, validation on Hugging Face; test via GitHub. Access: val open / test closed.
What it is: 20k MCQs across 14 fields; decontamination rules; public val mirrors closed-test difficulty.
Size & splits: 10k val, 10k test; dev slice for quick checks.
Provenance: Screened from >200B public docs; multi-stage cleaning/LLM checks.
License & use rights: CDLA-Permissive-2.0 on validation (redistribution with attribution; patent peace for data). Test set is closed; redistribution prohibited.
Safety & compliance: Public-web sourcing; provider claims safety screens; treat as mixed PD.
Contamination risk: Val is public; avoid tuning on leaked items; report val-selected/test-reported with submission date.
Baselines/benchmarks: Paper reports GPT-4o ~72% (0–5-shot) on test to show headroom.
Operational notes: Log access path, submission timestamp, and val license; archive the license text with your run.
What to watch next: Additional subjects and periodic refresh of the closed test.

Contamination, personal data, and redistribution

Public QA invites overlap with pretraining. For the MMLU family, assume non-zero contamination unless you adopt a gate (CF-style test) or a rolling “newsy” mix (LiveBench). If you run internal judge swaps or time-shifted subsets, report those as freshness controls alongside headline accuracy—see my recent article on Judge swaps, drifting agent evals.

For licensing, prefer data licenses with explicit rights for results (CDLA-Permissive-2.0) over software-only licenses when redistributing dataset derivatives; for CC-BY-4.0 sets, ensure downstream artifacts retain attribution and license links.

For tooling and harnesses, document the evaluator (e.g., OpenCompass and its LLM-as-judge guidance) so buyers can reproduce scoring.

Operational guidance

A lightweight checklist for teams that procure/evaluate/ship:

Benchmark ID: Record dataset name and version/DOI/commit + release date (e.g., LiveBench Apr 7 2025 reasoning task from HF).
License discipline: Log canonical license names (e.g., “CC-BY-4.0”, “CDLA-Permissive-2.0”). Link to the legal text (SPDX list; CDLA-Permissive-2.0; MIT; Apache-2.0).
Freshness controls: Run at least one time-shifted slice or judge swap; include both in the report.
Prompt & decoding: Store exact prompt templates, sampling params, and any CoT use; publish random-guess baselines (10-way for MMLU-Pro).
Gatekeeping: Separate selection (public val) from reporting (closed test) on MMLU-CF; keep access logs.
Contamination logging: Record URL-level hits if you reconstruct items; keep allow-list policies and dedup stats if you build variants.
Tooling provenance: Pin evaluator versions (e.g., OpenCompass docs) and judge configs.

What this means for teams

General-knowledge evals are moving from single numbers to paired readings. Expect buyers to ask for one open/public figure (MMLU-Redux or MMLU-Pro) and one gated/rolling figure (MMLU-CF test or LiveBench by month). Closed-test governance with public validation will become the norm for headline claims; rolling sets add recency but break cross-month comparability. Plan for calibration ladders—dated anchors that make score shifts legible—rather than a frozen leaderboard.

Open problem: A shared, dated calibration ladder for general knowledge (specifying sampling window, leakage policy, and scoring harness) so teams can compare across months and between open vs. closed tests without guesswork.

Actionable takeaway: In your next model card, report two lines per benchmark:

Open/public (Redux or Pro) with a freshness control, and
Gated/rolling (CF test or LiveBench-‹month›) with version/date and access path.

Pin evaluator versions, licenses, and prompts; keep URL-level artifacts and, for gated sets, store hashes/IDs rather than raw items.

Related releases

LiveBench — rolling, contamination-limited tasks with objective auto-grading; treat the release month as part of the benchmark ID. arXiv · site · GitHub.

MMLU-ProX (multilingual) — Pro-style difficulty extended to 13 languages; surfaces cross-lingual degradation and cultural variance. arXiv · project page.

Benchmarks & evals

MMLU-Pro leaderboard — public Space for side-by-side results; handy for sanity-checking your harness settings. Hugging Face.

LiveBench leaderboard — dated drops + objective scoring; compare models within the same month to avoid rotation drift. site, arXiv paper

OpenCompass — end-to-end evaluation stack with task partitioning, judge options, and MMLU-family support; pin version/config. docs · GitHub.

Lighteval — HF’s evaluation toolkit with task registry and sample-level artifacts; good for storing per-item results and commits. docs · GitHub.

EleutherAI lm-evaluation-harness — the long-running baseline harness many papers still report; useful for cross-checking prompts/metrics. GitHub.

Requests for datasets

We (Brickroad) pay a commission on the first closed deal when a referral leads to a purchase or placement. Details and terms → dataset.news/pitch.

Fantasy & Sci-Fi Imagery (CGI-enhanced, 4K+)
Specs: Fully rendered stills of space battles, alien worlds, mythical creatures, futuristic cityscapes, magical realms, dystopian landscapes, and character-driven action scenes. 4K+ resolution preferred; consistent aesthetic; annotations for scene description, character presence, lighting cues, etc. Sources may include 3D modeling, photorealistic rendering (Unreal/Blender/Maya), or high-end compositing. AI-generated images considered with strong fidelity and clear rights.
Rights: Licensor must own or control all rights; commercial ML training/redistribution permitted.
Target price: $0.10 per image
Use cases: image-to-image, text-to-image
Uncompressed Raw Video from Professional Cinema Cameras
Specs: Original camera files (uncompressed or lightly compressed RAW/ProRes) from RED, Blackmagic, ARRI Alexa, Sony Venice, Canon C-series, etc. Include metadata (resolution, frame rate, bit depth, lens, shooting conditions). Diverse environments (indoor/outdoor, low-light, high-motion); mix of handheld, gimbal, drone, tripod. Preference for consistent framing, color accuracy, and sensor-level detail. Minimum volume: ≥5 TB.
Rights: Submitters must confirm data ownership and grant commercial ML training rights; publisher indemnity preferred.
Target price: $400–$1,000 per TB (resolution, diversity, and metadata quality dependent)
Use cases: fine-tuning generative video models; post-production automation
Drone Videos of Cities (NA & EU)
Specs: High-quality drone footage of North American and European cities. Include in-camera and on-drone metadata: GPS, accelerometer, gimbal data. Variety of routes, altitudes, and lighting conditions encouraged.
Rights: Licensed for commercial AI/ML training; submitters attest to ownership/control of rights.
Target price: $5.00 per minute
Use cases: object detection, navigation, scene understanding

We don’t feature or broker datasets with unclear rights or unsafe personal data. Always confirm licences & provenance before commercial use.

If you need an NDA or have questions, email me at michael@brickroadapp.com.

What did this piece get right—or miss—about leakage and benchmark hygiene? Have you experienced contamination yourself or had a versioning policy that stuck? Leave a comment with one thing that broke and one control that worked.

If you buy data, the Brickroad marketplace is open: filter, preview, and license only the slices you need with standardized terms. Try the marketplace today.

If you supply data and want to list inventory, upload directly or email michael@brickroadapp.com.

— Michael Gordon, cofounder & CTO at Brickroad

Judge swaps, drifting agent evals

Michael Gordon — Thu, 14 Aug 2025 06:34:43 GMT

The Judges—Agentic Evaluators

Agent results have been whipsawing as evaluators—the judges—change under the hood. A single score now hides a system: task design, environment realism, and the judge’s prompt and snapshot. Style sensitivity (markdown, verbosity) and time sensitivity (live web vs. static pages) further nudge rankings. The through-line: evaluation is becoming a portfolio, not a number.

Why this matters

If your procurement or model-gating depends on a score, a silent judge swap can reshuffle the leaderboard and your roadmap with it. Teams are increasingly asked to justify upgrades (or rollbacks) when a new general-purpose model lands. Without pinned judges, style controls, and time-invariant tasks, you risk evaluating judge drift more than model progress.

Gotchas and fixes

Judge drift — a silent judge swap can reorder rankings. Fix: pin judge ID (family, snapshot, temperature) and publish dual-judge deltas; fail closed on unpinned judges. See Arena-Hard-Auto for judge-split reporting.
Style leakage — verbosity/markdown inflate scores without controls. Fix: switch to style-controlled judge prompts; keep a content-only ablation for audits.
Time variance — static pages give stable trendlines; live sites drift and break. Fix: report both; tag item-level time sensitivity and freeze a monthly static slice using WebArXiv.
Scoring blind spots — rule-only checks miss genuine successes; LLM judges vary by domain/prompt. Fix: run a rule+LLM ensemble, calibrate on a human-rated slice, and publish disagreement rates, following AgentRewardBench.

Evidence

I’m grounding this piece in a few anchors: Arena-Hard-Auto for judge-split reporting and style control, AgentRewardBench for how LLM judges agree (or don’t) on web-agent trajectories, and WebChoreArena for long-horizon, reproducible chores; I use WebArXiv as the static check. See full detail below.

Arena-Hard-Auto (v2 preview)

Judge-split reporting and style control to make evaluator effects visible; good for pinning judges and prompt hygiene.

Sourcing & access*: Hugging Face collection, GitHub. Access: open.
What it is: Hard, free-form prompts with automated grading.
Size & splits: v2 preview; separate leaderboards by judge family.
Provenance: Curated “hard” prompts; baselines from named APIs.
License & use rights: Apache-2.0 — redistribution permitted with notice.
Safety & compliance: Style control mitigates verbosity/formatting bias.
Contamination risk: Public-like prompts; pin baseline versions.
Baselines/benchmarks: Dual judge leaderboards.
Operational notes: Publish judge ID with scores.
What to watch next: Judge ensembles; harder safety subsets.

Arena-Hard-Auto (v2 preview)

Judge-split reporting and style control to make evaluator effects visible; good for pinning judges and prompt hygiene.

Sourcing & access*: Hugging Face collection, GitHub. Access: open.
What it is: Hard, free-form prompts with automated grading.
Size & splits: v2 preview; separate leaderboards by judge family.
Provenance: Curated “hard” prompts; baselines from named APIs.
License & use rights: Apache-2.0 — redistribution permitted with notice.
Safety & compliance: Style control mitigates verbosity/formatting bias.
Contamination risk: Public-like prompts; pin baseline versions.
Baselines/benchmarks: Dual judge leaderboards.
Operational notes: Publish judge ID with scores.
What to watch next: Judge ensembles; harder safety subsets.

AgentRewardBench (2025)

Expert-labeled trajectories to quantify judge agreement/disagreement; useful for calibrating a rule+LLM ensemble.

Sourcing & access: arXiv paper, HF dataset, GitHub.
What it is: 1,302 web-agent trajectories across multiple web-agent benchmarks.
Size & splits: ~1.4k labeled items; documented splits on HF.
Provenance: Trajectories sourced from WebArena / VisualWebArena / AssistantBench / WorkArena(++) and reviewed by experts.
License & use rights: No explicit SPDX license on HF; treat as restricted.
Safety & compliance: May contain screenshots of websites/apps.
Contamination risk: Low for pretraining (trajectories), but judge-training leakage is possible—pin versions.
Baselines/benchmarks: Authors compare multiple LLM judges; no single judge dominates across sets.
Operational notes: Parquet on HF; repo includes scoring scripts and submission guidance.
What to watch next: Learned / finetuned judges (e.g., WebJudge), per-domain PRMs.

WebChoreArena (2025)

Long-horizon, reproducible chores that cut live-web drift; anchors a time-invariant trendline.

Sourcing & access: GitHub; leaderboard / configs in-repo.
What it is: 532 tedious, multi-step chore tasks on simulated sites (built on WebArena) stressing memory, calculation, and long-horizon context.
Size & splits: Full set + smaller subset for cost-controlled runs.
Provenance: Extends WebArena/VisualWebArena; reproducible self-hosted sites with reset scripts; leaderboard added August 2025.
License & use rights: Apache-2.0 — redistribution permitted with notice.
Safety & compliance: Simulated content reduces PD exposure vs. live web.
Contamination risk: Low for pretraining; avoid mixing generated traces back into training without URL-level logs.
Baselines/benchmarks: Reported gaps across agents and judge models; model updates can regress.
Operational notes: Runs on BrowserGym/AgentOccam; full-set costs can be high.
What to watch next: Cross-site tasks; ensemble judge reporting.

Contamination, personal data, and redistribution

Live-web agent sets implicate site terms of use and can capture incidental PII in screenshots and logs. Static sets such as WebArXiv reduce this but embedded assets may carry third-party rights. Treat AgentRewardBench as non-redistributable unless clarified; “research-only” or fair-use language is jurisdiction-dependent. Lean towards simulated environments like WebChoreArena (and WebArena variants) when you need clean redistribution, and keep URL-level logs for any real-web captures.

Operational guidance

Pin the judge: record model family/name, snapshot string, and prompt-template hash; publish with scores. See Arena-Hard-Auto for judge-split reporting.
Use style-controlled judging: adopt style-control prompts; keep a content-only ablation for audits.
Report rule + LLM ensemble: pair a task rule with an LLM judge; calibrate on a human-rated slice and publish disagreement rates, per AgentRewardBench.
Blend static and live: include a static anchor (e.g., WebArXiv) and one live suite; tag item-level time sensitivity.
Version and route datasets: treat terms-of-use-gated assets as non-redistributable by default; for Apache include NOTICE and link Apache-2.0.
Add freshness controls: rerun monthly on the same judge and a swapped judge; include a paraphrase or time-shifted slice.
Red-team the judge: vary templates and attack prompts; compare deltas to a fixed, human-rated subset before shipping a score.

What this means for teams

Judge swaps are now part of the evaluation surface. Split reporting (e.g. Arena-Hard-Auto) makes that dependence visible; learned judges and ensembles will likely normalize multi-judge scorecards rather than replace them. Static anchors and reproducible chores provide the longitudinal backbone; plan for eval portfolios, not single numbers.

Open problem: A vendor-neutral judge ID schema (family, snapshot, template, temperature) with signed metadata so scores are portable and auditable across orgs.

Actionable takeaway: Starting now, report (a) dual-judge results, (b) a style-controlled judge, and (c) a static-set trend line. Pin everything (judge + tasks) and ship URL-level logs with your scorecard.