<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Dataset News]]></title><description><![CDATA[Curated, verified updates on datasets, benchmarks, and licensing for researchers and teams who buy, evaluate, and ship with data.]]></description><link>https://www.dataset.news</link><image><url>https://substackcdn.com/image/fetch/$s_!H_R-!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78defe34-0667-4f9c-91a6-d7e0d319e55a_1024x1024.png</url><title>Dataset News</title><link>https://www.dataset.news</link></image><generator>Substack</generator><lastBuildDate>Fri, 17 Apr 2026 05:56:15 GMT</lastBuildDate><atom:link href="https://www.dataset.news/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Michael Gordon]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[datasetnews@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[datasetnews@substack.com]]></itunes:email><itunes:name><![CDATA[Michael Gordon]]></itunes:name></itunes:owner><itunes:author><![CDATA[Michael Gordon]]></itunes:author><googleplay:owner><![CDATA[datasetnews@substack.com]]></googleplay:owner><googleplay:email><![CDATA[datasetnews@substack.com]]></googleplay:email><googleplay:author><![CDATA[Michael Gordon]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Refreshing general-knowledge benchmarks without leakage]]></title><description><![CDATA[Static QA tests are showing their age; how to refresh without rewarding recollection.]]></description><link>https://www.dataset.news/p/refreshing-general-knowledge-benchmarks</link><guid isPermaLink="false">https://www.dataset.news/p/refreshing-general-knowledge-benchmarks</guid><dc:creator><![CDATA[Michael Gordon]]></dc:creator><pubDate>Thu, 21 Aug 2025 07:14:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!bpWI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a3b8f8-d669-490c-b701-83d713ee0421_1999x1125.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bpWI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a3b8f8-d669-490c-b701-83d713ee0421_1999x1125.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bpWI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a3b8f8-d669-490c-b701-83d713ee0421_1999x1125.jpeg 424w, https://substackcdn.com/image/fetch/$s_!bpWI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a3b8f8-d669-490c-b701-83d713ee0421_1999x1125.jpeg 848w, https://substackcdn.com/image/fetch/$s_!bpWI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a3b8f8-d669-490c-b701-83d713ee0421_1999x1125.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!bpWI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a3b8f8-d669-490c-b701-83d713ee0421_1999x1125.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bpWI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a3b8f8-d669-490c-b701-83d713ee0421_1999x1125.jpeg" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/99a3b8f8-d669-490c-b701-83d713ee0421_1999x1125.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:278931,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.dataset.news/i/171448928?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a3b8f8-d669-490c-b701-83d713ee0421_1999x1125.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bpWI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a3b8f8-d669-490c-b701-83d713ee0421_1999x1125.jpeg 424w, https://substackcdn.com/image/fetch/$s_!bpWI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a3b8f8-d669-490c-b701-83d713ee0421_1999x1125.jpeg 848w, https://substackcdn.com/image/fetch/$s_!bpWI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a3b8f8-d669-490c-b701-83d713ee0421_1999x1125.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!bpWI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a3b8f8-d669-490c-b701-83d713ee0421_1999x1125.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">When results depend on recollection &#8212; film frame from <em>Eternal Sunshine of the Spotless Mind</em> (2004). &#169; Focus Features. Used for commentary.</figcaption></figure></div><p>General-knowledge benchmarks like MMLU (<em>Massive Multitask Language Understanding&#8212;<a href="https://arxiv.org/abs/2009.03300">Hendrycks et al., 2020</a>; <a href="https://huggingface.co/datasets/cais/mmlu">HF dataset card</a></em>) gave us a shared yardstick for LLMs, but as models absorb more of the internet, their usefulness erodes, and the tests along with them. Recent re-releases and companions&#8212;MMLU-Redux (<a href="https://arxiv.org/abs/2406.04127">Gema et al., 2024</a>), MMLU-Pro (<a href="https://arxiv.org/abs/2406.01574">Wang et al., 2024</a>), MMLU-CF (Zhao et al., 2024), and LiveBench (<a href="https://arxiv.org/abs/2406.19314">White et al., 2024</a>)&#8212;aim to restore headroom and reduce contamination. </p><p>In this issue we&#8217;re going to look at what&#8217;s actually changed, how to use these sets without leaking your own evals, and what to log so your scores mean something to buyers, regulators, and your future self.</p><h3>Why this matters</h3><p>Almost universally, benchmarks have been used to drive enterprise procurement and launch adoption. When a test leaks, you get inflated, brittle scores and poor cross-lab comparability&#8212;<em><a href="https://arxiv.org/abs/2503.16402">The Emperor&#8217;s New Clothes in Benchmarking?</a></em> <a href="https://icml.cc/virtual/2025/poster/45153">(ICML 2025)</a> shows in controlled &#8220;mild&#8221; and &#8220;intensive&#8221; contamination scenarios (10 LLMs, 5 benchmarks) that benchmark data contamination leads to falsely inflated performance estimates, and that common &#8220;update the test&#8221; mitigations don&#8217;t reliably restore contamination resistance.</p><p>The last year saw three distinct approaches to &#8220;refreshing&#8221; MMLU-style evals:</p><ul><li><p><strong>MMLU-Redux</strong>:<em> </em>Re-annotation to correct labels and item quality.</p></li><li><p><strong>MMLU-Pro:</strong><em> </em>Hardening via trickier items and more answer choices.</p></li><li><p><strong>MMLU-CF:</strong> Gating &amp; split discipline with a closed test and public validation.</p></li></ul><h4>Gotchas and fixes</h4><p>Four common failures and practical controls to ship with your scorecard:</p><ul><li><p><strong>Label errors (MMLU-Redux)</strong> &#8212; legacy MMLU contains mislabeled/ambiguous items that create artificial deltas. <strong>Fix:</strong> pin the corrected set (v2.0 DOI) and report dual-key accuracy (original vs. corrected); slice by error tag to localize regressions. See <a href="https://arxiv.org/abs/2406.04127">Gema et al., 2024</a> and the <a href="https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0">HF dataset card</a>.</p></li><li><p><strong>Guessing headroom (MMLU-Pro)</strong> &#8212; 4-choice MCQs inflate chance performance and amplify prompt sensitivity. <strong>Fix:</strong> switch to the 10-choice variant, log the 10% random-guess baseline, pin choice order/shuffle seed, and report calibrated accuracy alongside raw. See <a href="https://arxiv.org/abs/2406.01574">Wang et al., 2024</a> and the <a href="https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro">HF dataset card</a>.</p></li><li><p><strong>Split hygiene (MMLU-CF)</strong> &#8212; mixing tuning and reporting on a public split (or touching the closed test during dev) invites leakage. <strong>Fix:</strong> select on the validation split, report via the closed test, record submission date/access path, and log the validation license (CDLA-Permissive-2.0). See the paper, HF dataset, and <a href="https://cdla.dev/permissive-2-0/">CDLA text</a>.</p></li><li><p><strong>Rolling drift (LiveBench)</strong> &#8212; month-to-month rotations and grader config changes break cross-month comparability. <strong>Fix:</strong> treat the release month as part of the benchmark ID, pin commit/task subset, record auto-grader/judge config, and compare models within the same month. See the <a href="https://arxiv.org/abs/2406.19314">paper</a>, site, and <a href="https://github.com/LiveBench/LiveBench">GitHub</a>.</p></li></ul><h3>Evidence</h3><p>This piece was grounded in four refresh paths for general-knowledge evals: MMLU-Redux (v2.0) for label corrections and error tags, MMLU-Pro for 10-choice reasoning that lowers guess headroom, MMLU-CF for a <em>public-val/closed-test</em> split with decontamination rules, and <a href="https://arxiv.org/abs/2406.19314">LiveBench</a> for <em>dated</em> monthly drops with automatic scoring. See full detail below. </p><h4><strong>MMLU-Redux (v2.0)</strong></h4><p><em>Label-corrected re-annotation across all 57 subjects; useful for dual-key reporting and error-type slices.</em> </p><ul><li><p><strong>Sourcing &amp; access</strong>:  <a href="https://arxiv.org/abs/2406.04127?utm_source=chatgpt.com">arXiv paper</a>, <a href="https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0">HF dataset</a>. Access: open.</p></li><li><p><strong>What it is</strong>: Multiple-choice questions with corrected keys and error tags spanning all 57 MMLU subjects.</p></li><li><p><strong>Size &amp; splits</strong>: ~5.7k re-annotated items; v2.0 covers all subjects (earlier Redux subsets were smaller). </p></li><li><p><strong>Provenance</strong>: Re-annotations of cais/mmlu following a documented error protocol.</p></li><li><p><strong>License &amp; use rights</strong>: <a href="https://creativecommons.org/licenses/by/4.0/legalcode.en">CC-BY-4.0</a> (redistribution with attribution; no purpose limits).</p></li><li><p><strong>Safety &amp; compliance</strong>: Notes on ambiguous/problematic items; tracked by HF. </p></li><li><p><strong>Contamination risk</strong>: Same subject space as MMLU; use dual-key reporting to detect mislabeled-induced deltas. </p></li><li><p><strong>Baselines/benchmarks</strong>: Paper reports model reorderings under corrected keys. </p></li><li><p><strong>Operational notes</strong>: Pin the dataset DOI and commit hash; log whether scores are <em>original-key</em> vs <em>corrected-key</em>. </p></li><li><p>What to watch next: NAACL 2025 camera-ready + expanded error taxonomy. <a href="https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux?utm_source=chatgpt.com">Hugging Face</a></p></li></ul><h4><strong>MMLU-Pro (2024-06-03)</strong></h4><p><em>Ten-choice, harder questions to cut chance performance and reduce prompt sensitivity; good for calibrated-accuracy reporting.</em> </p><ul><li><p><strong>Sourcing &amp; access</strong>: <a href="https://arxiv.org/abs/2406.01574">arXiv paper</a>, <a href="https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro">HF dataset</a>, <a href="https://github.com/TIGER-AI-Lab/MMLU-Pro">GitHub</a>. Access: open. </p></li><li><p><strong>What it is</strong>: A 12k-item, 10-choice MCQ suite emphasizing reasoning. </p></li><li><p><strong>Size &amp; splits</strong>: ~12,102 test items; small validation slice.</p></li><li><p><strong>Provenance</strong>: Curated from multiple sources.</p></li><li><p><strong>License &amp; use rights</strong>: MIT (redistribution permitted with notice).</p></li><li><p><strong>Safety &amp; compliance</strong>: General-knowledge items; standard PD risk; no minors data.</p></li><li><p><strong>Contamination risk</strong>: Some overlapable stems; pin choice order/shuffle seeds.</p></li><li><p><strong>Baselines/benchmarks</strong>: Paper shows 16&#8211;33% drops vs. MMLU and lower prompt sensitivity.</p></li><li><p><strong>Operational notes</strong>: Record random-guess baseline (10%), calibrated accuracy, and exact commit/DOI. </p></li><li><p><strong>What to watch next</strong>: Variants (e.g., ProX) and per-domain harder subsets. </p></li></ul><h4><strong>MMLU-CF (2024-12-19)</strong></h4><p><em>Contamination-limited benchmark with a public validation set and closed test; suitable for clean model selection + sealed reporting.</em></p><ul><li><p><strong>Sourcing &amp; access</strong>: <a href="https://arxiv.org/abs/2412.15194">arXiv paper</a>, validation on <a href="https://huggingface.co/datasets/microsoft/MMLU-CF">Hugging Face</a>; test via <a href="https://github.com/microsoft/MMLU-CF">GitHub</a>. Access: val open / test closed.</p></li><li><p><strong>What it is</strong>: 20k MCQs across 14 fields; decontamination rules; public val mirrors closed-test difficulty. </p></li><li><p><strong>Size &amp; splits</strong>: 10k val, 10k test; dev slice for quick checks. </p></li><li><p><strong>Provenance</strong>: Screened from &gt;200B public docs; multi-stage cleaning/LLM checks. </p></li><li><p><strong>License &amp; use rights</strong>: <a href="https://cdla.dev/permissive-2-0/">CDLA-Permissive-2.0</a> on validation (redistribution with attribution; patent peace for data). Test set is closed; redistribution prohibited. </p></li><li><p><strong>Safety &amp; compliance</strong>: Public-web sourcing; provider claims safety screens; treat as mixed PD.</p></li><li><p><strong>Contamination risk</strong>: Val is public; avoid tuning on leaked items; report <em>val-selected/test-reported</em> with submission date.</p></li><li><p><strong>Baselines/benchmarks</strong>: Paper reports GPT-4o ~72% (0&#8211;5-shot) on test to show headroom.</p></li><li><p><strong>Operational notes</strong>: Log access path, submission timestamp, and val license; archive the license text with your run. </p></li><li><p><strong>What to watch next</strong>: Additional subjects and periodic refresh of the closed test. </p></li></ul><h3>Contamination, personal data, and redistribution</h3><p>Public QA invites overlap with pretraining. For the MMLU family, assume non-zero contamination unless you adopt a gate (CF-style test) or a rolling &#8220;newsy&#8221; mix (LiveBench). If you run internal judge swaps or time-shifted subsets, report those as freshness controls alongside headline accuracy&#8212;see my recent article on <a href="https://www.dataset.news/p/judge-swaps-drifting-agent-evals">Judge swaps, drifting agent evals</a>. </p><p>For licensing, prefer data licenses with explicit rights for results (<a href="https://cdla.dev/permissive-2-0/">CDLA-Permissive-2.0</a>) over software-only licenses when redistributing dataset derivatives; for <a href="https://creativecommons.org/licenses/by/4.0/legalcode.en">CC-BY-4.0</a> sets, ensure downstream artifacts retain attribution and license links. </p><p>For tooling and harnesses, document the evaluator (e.g., <a href="https://github.com/open-compass/opencompass">OpenCompass</a> and its LLM-as-judge guidance) so buyers can reproduce scoring.</p><h3>Operational guidance</h3><p>A lightweight checklist for teams that procure/evaluate/ship:</p><ol><li><p><strong>Benchmark ID:</strong> Record dataset name <strong>and</strong> version/DOI/commit + <strong>release date</strong> (e.g., LiveBench Apr 7 2025 reasoning task from <a href="https://huggingface.co/datasets/livebench/reasoning">HF</a>).</p></li><li><p><strong>License discipline:</strong> Log canonical license names (e.g., &#8220;CC-BY-4.0&#8221;, &#8220;CDLA-Permissive-2.0&#8221;). Link to the legal text (SPDX list; <a href="https://cdla.dev/permissive-2-0/">CDLA-Permissive-2.0</a>; MIT; <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache-2.0</a>).</p></li><li><p><strong>Freshness controls:</strong> Run at least one time-shifted slice or judge swap; include both in the report.</p></li><li><p><strong>Prompt &amp; decoding:</strong> Store exact prompt templates, sampling params, and any CoT use; publish random-guess baselines (10-way for MMLU-Pro).</p></li><li><p><strong>Gatekeeping:</strong> Separate <strong>selection</strong> (public val) from <strong>reporting</strong> (closed test) on MMLU-CF; keep access logs.</p></li><li><p><strong>Contamination logging:</strong> Record URL-level hits if you reconstruct items; keep allow-list policies and dedup stats if you build variants.</p></li><li><p><strong>Tooling provenance:</strong> Pin evaluator versions (e.g., OpenCompass docs) and judge configs.</p></li></ol><h3>What this means for teams</h3><p>General-knowledge evals are moving from single numbers to paired readings. Expect buyers to ask for one open/public figure (MMLU-Redux or MMLU-Pro) and one gated/rolling figure (MMLU-CF test or LiveBench by month). Closed-test governance with public validation will become the norm for headline claims; rolling sets add recency but break cross-month comparability. Plan for calibration ladders&#8212;dated anchors that make score shifts legible&#8212;rather than a frozen leaderboard.</p><p><strong>Open problem:</strong> A shared, dated calibration ladder for general knowledge (specifying sampling window, leakage policy, and scoring harness) so teams can compare across months and between open vs. closed tests without guesswork.</p><p><strong>Actionable takeaway:</strong> In your next model card, report two lines per benchmark: </p><ol><li><p>Open/public (Redux or Pro) with a freshness control, and</p></li><li><p>Gated/rolling (CF test or LiveBench-&#8249;month&#8250;) with version/date and access path. </p></li></ol><p>Pin evaluator versions, licenses, and prompts; keep URL-level artifacts and, for gated sets, store hashes/IDs rather than raw items.</p><div><hr></div><h2>Related releases</h2><p><strong>LiveBench</strong> &#8212; rolling, contamination-limited tasks with objective auto-grading; treat the release month as part of the benchmark ID. <a href="https://arxiv.org/abs/2406.19314">arXiv</a> &#183; <a href="https://livebench.ai/">site</a> &#183; <a href="https://github.com/LiveBench/LiveBench">GitHub</a>. </p><p><strong>MMLU-ProX (multilingual)</strong> &#8212; Pro-style difficulty extended to 13 languages; surfaces cross-lingual degradation and cultural variance. <a href="https://arxiv.org/abs/2503.10497">arXiv</a> &#183; <a href="https://mmluprox.github.io/">project page</a>.</p><div><hr></div><h2>Benchmarks &amp; evals</h2><p><strong>MMLU-Pro leaderboard</strong> &#8212; public Space for side-by-side results; handy for sanity-checking your harness settings. <a href="https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro">Hugging Face</a>. </p><p><strong>LiveBench leaderboard</strong> &#8212; dated drops + objective scoring; compare models within the same month to avoid rotation drift. <a href="https://livebench.ai/">site</a>, <a href="https://arxiv.org/pdf/2406.19314">arXiv paper</a></p><p><strong>OpenCompass</strong> &#8212; end-to-end evaluation stack with task partitioning, judge options, and MMLU-family support; pin version/config. <a href="https://opencompass.readthedocs.io/en/latest/user_guides/framework_overview.html">docs</a> &#183; <a href="https://github.com/open-compass/opencompass">GitHub</a>. </p><p><strong>Lighteval</strong> &#8212; HF&#8217;s evaluation toolkit with task registry and sample-level artifacts; good for storing per-item results and commits. <a href="https://huggingface.co/docs/lighteval/en/index">docs</a> &#183; <a href="https://github.com/huggingface/lighteval">GitHub</a>.</p><p><strong>EleutherAI lm-evaluation-harness</strong> &#8212; the long-running baseline harness many papers still report; useful for cross-checking prompts/metrics. <a href="https://github.com/EleutherAI/lm-evaluation-harness">GitHub</a>. </p><div><hr></div><h2>Requests for datasets</h2><p>We (<a href="https://www.brickroadapp.com/">Brickroad</a>) pay a commission on the first closed deal when a referral leads to a purchase or placement. Details and terms &#8594; <a href="http://www.dataset.news/pitch">dataset.news/pitch</a>.</p><ul><li><p><strong>Fantasy &amp; Sci-Fi Imagery (CGI-enhanced, 4K+)</strong></p><p><em>Specs</em>: Fully rendered stills of space battles, alien worlds, mythical creatures, futuristic cityscapes, magical realms, dystopian landscapes, and character-driven action scenes. 4K+ resolution preferred; consistent aesthetic; annotations for scene description, character presence, lighting cues, etc. Sources may include 3D modeling, photorealistic rendering (Unreal/Blender/Maya), or high-end compositing. AI-generated images considered with strong fidelity and clear rights.</p><p><em>Rights</em>: Licensor must own or control all rights; commercial ML training/redistribution permitted.</p><p><em>Target price</em>: $0.10 per image</p><p><em>Use cases</em>: image-to-image, text-to-image</p></li><li><p><strong>Uncompressed Raw Video from Professional Cinema Cameras</strong></p><p><em>Specs</em>: Original camera files (uncompressed or lightly compressed RAW/ProRes) from RED, Blackmagic, ARRI Alexa, Sony Venice, Canon C-series, etc. Include metadata (resolution, frame rate, bit depth, lens, shooting conditions). Diverse environments (indoor/outdoor, low-light, high-motion); mix of handheld, gimbal, drone, tripod. Preference for consistent framing, color accuracy, and sensor-level detail. Minimum volume: &#8805;5 TB.</p><p><em>Rights</em>: Submitters must confirm data ownership and grant commercial ML training rights; publisher indemnity preferred.</p><p><em>Target price</em>: $400&#8211;$1,000 per TB (resolution, diversity, and metadata quality dependent)</p><p><em>Use cases</em>: fine-tuning generative video models; post-production automation</p></li><li><p><strong>Drone Videos of Cities (NA &amp; EU)</strong></p><p><em>Specs</em>: High-quality drone footage of North American and European cities. Include in-camera and on-drone metadata: GPS, accelerometer, gimbal data. Variety of routes, altitudes, and lighting conditions encouraged.</p><p><em>Rights</em>: Licensed for commercial AI/ML training; submitters attest to ownership/control of rights.</p><p><em>Target price</em>: $5.00 per minute</p><p><em>Use cases</em>: object detection, navigation, scene understanding</p></li></ul><p><em>We don&#8217;t feature or broker datasets with unclear rights or unsafe personal data. Always confirm licences &amp; provenance before commercial use.</em></p><p>If you need an NDA or have questions, email me at <a href="mailto:michael@brickroadapp.com">michael@brickroadapp.com</a>.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.dataset.news/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Dataset News! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>What did this piece get right&#8212;or miss&#8212;about leakage and benchmark hygiene? Have you experienced contamination yourself or had a versioning policy that stuck? Leave a comment with one thing that broke and one control that worked.</p><p>If you buy data, the Brickroad marketplace is open: filter, preview, and license only the slices you need with standardized terms. <a href="https://www.brickroadapp.com/">Try the marketplace today</a>.</p><p>If you supply data and want to list inventory, <a href="https://www.brickroadapp.com/">upload directly</a> or email <a href="mailto:michael@brickroadapp.com">michael@brickroadapp.com</a>.</p><p>&#8212; Michael Gordon, cofounder &amp; CTO at Brickroad</p>]]></content:encoded></item><item><title><![CDATA[Judge swaps, drifting agent evals]]></title><description><![CDATA[Leaderboards move when the judge changes. Here&#8217;s how to keep web-agent evaluations stable enough to buy, benchmark, and ship against.]]></description><link>https://www.dataset.news/p/judge-swaps-drifting-agent-evals</link><guid isPermaLink="false">https://www.dataset.news/p/judge-swaps-drifting-agent-evals</guid><dc:creator><![CDATA[Michael Gordon]]></dc:creator><pubDate>Thu, 14 Aug 2025 06:34:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!PlpM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0926c702-d057-4f88-8e1f-761102dca676_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PlpM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0926c702-d057-4f88-8e1f-761102dca676_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PlpM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0926c702-d057-4f88-8e1f-761102dca676_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!PlpM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0926c702-d057-4f88-8e1f-761102dca676_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!PlpM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0926c702-d057-4f88-8e1f-761102dca676_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!PlpM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0926c702-d057-4f88-8e1f-761102dca676_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PlpM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0926c702-d057-4f88-8e1f-761102dca676_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0926c702-d057-4f88-8e1f-761102dca676_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2601914,&quot;alt&quot;:&quot;agentic-evaluators&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.dataset.news/i/170857011?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0926c702-d057-4f88-8e1f-761102dca676_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="agentic-evaluators" title="agentic-evaluators" srcset="https://substackcdn.com/image/fetch/$s_!PlpM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0926c702-d057-4f88-8e1f-761102dca676_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!PlpM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0926c702-d057-4f88-8e1f-761102dca676_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!PlpM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0926c702-d057-4f88-8e1f-761102dca676_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!PlpM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0926c702-d057-4f88-8e1f-761102dca676_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Judges&#8212;Agentic Evaluators</figcaption></figure></div><p>Agent results have been whipsawing as evaluators&#8212;<em>the judges</em>&#8212;change under the hood. A single score now hides a system: task design, environment realism, and the judge&#8217;s prompt and snapshot. Style sensitivity (markdown, verbosity) and time sensitivity (live web vs. static pages) further nudge rankings. The through-line: evaluation is becoming a portfolio, not a number.</p><h3>Why this matters</h3><p>If your procurement or model-gating depends on a score, a silent judge swap can reshuffle the leaderboard and your roadmap with it. Teams are increasingly asked to justify upgrades (or rollbacks) when a new general-purpose model lands. Without pinned judges, style controls, and time-invariant tasks, you risk evaluating <em>judge drift</em> more than model progress.</p><h4>Gotchas and fixes</h4><ul><li><p><strong>Judge drift</strong> &#8212; a silent judge swap can reorder rankings. <strong>Fix:</strong> pin judge ID (family, snapshot, temperature) and publish dual-judge deltas; fail closed on unpinned judges. See <a href="https://github.com/lm-sys/arena-hard-auto">Arena-Hard-Auto</a> for judge-split reporting.</p></li><li><p><strong>Style leakage</strong> &#8212; verbosity/markdown inflate scores without controls. <strong>Fix:</strong> switch to style-controlled judge prompts; keep a content-only ablation for audits.</p></li><li><p><strong>Time variance</strong> &#8212; static pages give stable trendlines; live sites drift and break. <strong>Fix:</strong> report both; tag item-level time sensitivity and freeze a monthly static slice using <a href="https://arxiv.org/abs/2507.00938">WebArXiv</a>.</p></li><li><p><strong>Scoring blind spots</strong> &#8212; rule-only checks miss genuine successes; LLM judges vary by domain/prompt. <strong>Fix:</strong> run a rule+LLM ensemble, calibrate on a human-rated slice, and publish disagreement rates, following <a href="https://arxiv.org/abs/2504.08942">AgentRewardBench</a>.</p></li></ul><h3>Evidence</h3><p>I&#8217;m grounding this piece in a few anchors: <strong><a href="https://github.com/lm-sys/arena-hard-auto">Arena-Hard-Auto</a></strong> for judge-split reporting and style control, <strong><a href="https://arxiv.org/abs/2504.08942">AgentRewardBench</a></strong> for how LLM judges agree (or don&#8217;t) on web-agent trajectories, and <strong><a href="https://github.com/WebChoreArena/WebChoreArena">WebChoreArena</a></strong> for long-horizon, reproducible chores; I use <strong><a href="https://arxiv.org/abs/2507.00938">WebArXiv</a></strong> as the static check. See full detail below.</p><h4><strong>Arena-Hard-Auto (v2 preview)</strong></h4><p><em>Judge-split reporting and style control to make evaluator effects visible; good for pinning judges and prompt hygiene.</em></p><ul><li><p><strong>Sourcing &amp; access</strong>*: Hugging Face collection, <a href="https://github.com/lm-sys/arena-hard-auto">GitHub</a>. Access: open.</p></li><li><p><strong>What it is</strong>: Hard, free-form prompts with automated grading.</p></li><li><p><strong>Size &amp; splits</strong>: v2 preview; separate leaderboards by judge family.</p></li><li><p><strong>Provenance</strong>: Curated &#8220;hard&#8221; prompts; baselines from named APIs.</p></li><li><p><strong>License &amp; use rights</strong>: Apache-2.0 &#8212; redistribution permitted with notice.</p></li><li><p><strong>Safety &amp; compliance</strong>: Style control mitigates verbosity/formatting bias.</p></li><li><p><strong>Contamination risk</strong>: Public-like prompts; pin baseline versions.</p></li><li><p><strong>Baselines/benchmarks</strong>: Dual judge leaderboards.</p></li><li><p><strong>Operational notes</strong>: Publish judge ID with scores.</p></li><li><p><strong>What to watch next</strong>: Judge ensembles; harder safety subsets.</p></li></ul><h4><strong>Arena-Hard-Auto (v2 preview)</strong></h4><p><em>Judge-split reporting and style control to make evaluator effects visible; good for pinning judges and prompt hygiene.</em></p><ul><li><p><strong>Sourcing &amp; access</strong>*: Hugging Face collection, <a href="https://github.com/lm-sys/arena-hard-auto">GitHub</a>. Access: open.</p></li><li><p><strong>What it is</strong>: Hard, free-form prompts with automated grading.</p></li><li><p><strong>Size &amp; splits</strong>: v2 preview; separate leaderboards by judge family.</p></li><li><p><strong>Provenance</strong>: Curated &#8220;hard&#8221; prompts; baselines from named APIs.</p></li><li><p><strong>License &amp; use rights</strong>: Apache-2.0 &#8212; redistribution permitted with notice.</p></li><li><p><strong>Safety &amp; compliance</strong>: Style control mitigates verbosity/formatting bias.</p></li><li><p><strong>Contamination risk</strong>: Public-like prompts; pin baseline versions.</p></li><li><p><strong>Baselines/benchmarks</strong>: Dual judge leaderboards.</p></li><li><p><strong>Operational notes</strong>: Publish judge ID with scores.</p></li><li><p><strong>What to watch next</strong>: Judge ensembles; harder safety subsets.</p></li></ul><h4><strong>AgentRewardBench (2025)</strong></h4><p>Expert-labeled trajectories to quantify judge agreement/disagreement; useful for calibrating a rule+LLM ensemble.</p><ul><li><p><strong>Sourcing &amp; access</strong>: <a href="https://arxiv.org/abs/2504.08942">arXiv paper</a>, <a href="https://huggingface.co/datasets/McGill-NLP/agent-reward-bench">HF dataset</a>, <a href="https://github.com/McGill-NLP/agent-reward-bench">GitHub</a>.</p></li><li><p><strong>What it is</strong>: 1,302 web-agent trajectories across multiple web-agent benchmarks.</p></li><li><p><strong>Size &amp; splits</strong>: ~1.4k labeled items; documented splits on HF.</p></li><li><p><strong>Provenance</strong>: Trajectories sourced from WebArena / VisualWebArena / AssistantBench / WorkArena(++) and reviewed by experts.</p></li><li><p><strong>License &amp; use rights</strong>: No explicit SPDX license on HF; treat as restricted.</p></li><li><p><strong>Safety &amp; compliance</strong>: May contain screenshots of websites/apps.</p></li><li><p><strong>Contamination risk</strong>: Low for pretraining (trajectories), but judge-training leakage is possible&#8212;pin versions.</p></li><li><p><strong>Baselines/benchmarks</strong>: Authors compare multiple LLM judges; no single judge dominates across sets.</p></li><li><p><strong>Operational notes</strong>: Parquet on HF; repo includes scoring scripts and submission guidance.</p></li><li><p><strong>What to watch next</strong>: Learned / finetuned judges (e.g., WebJudge), per-domain PRMs.</p></li></ul><h4><strong>WebChoreArena (2025)</strong></h4><p><em>Long-horizon, reproducible chores that cut live-web drift; anchors a time-invariant trendline.</em></p><ul><li><p><strong>Sourcing &amp; access</strong>: <a href="https://github.com/WebChoreArena/WebChoreArena">GitHub</a>; leaderboard / configs in-repo.</p></li><li><p><strong>What it is</strong>: 532 tedious, multi-step <em>chore</em> tasks on simulated sites (built on WebArena) stressing memory, calculation, and long-horizon context.</p></li><li><p><strong>Size &amp; splits</strong>: Full set + smaller subset for cost-controlled runs.</p></li><li><p><strong>Provenance</strong>: Extends WebArena/VisualWebArena; reproducible self-hosted sites with reset scripts; leaderboard added August 2025.</p></li><li><p><strong>License &amp; use rights</strong>: Apache-2.0 &#8212; redistribution permitted with notice.</p></li><li><p><strong>Safety &amp; compliance</strong>: Simulated content reduces PD exposure vs. live web.</p></li><li><p><strong>Contamination risk</strong>: Low for pretraining; avoid mixing generated traces back into training without URL-level logs.</p></li><li><p><strong>Baselines/benchmarks</strong>: Reported gaps across agents and judge models; model updates can regress.</p></li><li><p><strong>Operational notes</strong>: Runs on BrowserGym/AgentOccam; full-set costs can be high.</p></li><li><p><strong>What to watch next</strong>: Cross-site tasks; ensemble judge reporting.</p></li></ul><h3>Contamination, personal data, and redistribution</h3><p>Live-web agent sets implicate site terms of use and can capture incidental PII in screenshots and logs. Static sets such as <a href="https://arxiv.org/abs/2507.00938">WebArXiv</a> reduce this but embedded assets may carry third-party rights. Treat <a href="https://huggingface.co/datasets/McGill-NLP/agent-reward-bench">AgentRewardBench</a> as non-redistributable unless clarified; &#8220;research-only&#8221; or fair-use language is jurisdiction-dependent. Lean towards simulated environments like <a href="https://github.com/WebChoreArena/WebChoreArena">WebChoreArena</a> (and WebArena variants) when you need clean redistribution, and keep URL-level logs for any real-web captures.</p><h3>Operational guidance</h3><ul><li><p><strong>Pin the judge</strong>: record model family/name, snapshot string, and prompt-template hash; publish with scores. See <a href="https://github.com/lm-sys/arena-hard-auto">Arena-Hard-Auto</a> for judge-split reporting.</p></li><li><p><strong>Use style-controlled judging</strong>: adopt style-control prompts; keep a content-only ablation for audits.</p></li><li><p><strong>Report rule + LLM ensemble</strong>: pair a task rule with an LLM judge; calibrate on a human-rated slice and publish disagreement rates, per <a href="https://arxiv.org/abs/2504.08942">AgentRewardBench</a>.</p></li><li><p><strong>Blend static and live</strong>: include a static anchor (e.g., <a href="https://arxiv.org/abs/2507.00938">WebArXiv</a>) and one live suite; tag item-level time sensitivity.</p></li><li><p><strong>Version and route datasets</strong>: treat terms-of-use-gated assets as non-redistributable by default; for Apache include NOTICE and link <a href="https://www.apache.org/licenses/LICENSE-2.0.txt">Apache-2.0</a>.</p></li><li><p><strong>Add freshness controls</strong>: rerun monthly on the same judge and a swapped judge; include a paraphrase or time-shifted slice.</p></li><li><p><strong>Red-team the judge</strong>: vary templates and attack prompts; compare deltas to a fixed, human-rated subset before shipping a score.</p></li></ul><h3>What this means for teams</h3><p>Judge swaps are now part of the evaluation surface. Split reporting (e.g. <a href="https://github.com/lm-sys/arena-hard-auto">Arena-Hard-Auto</a>) makes that dependence visible; learned judges and ensembles will likely normalize multi-judge scorecards rather than replace them. Static anchors and reproducible chores provide the longitudinal backbone; plan for <em>eval portfolios</em>, not single numbers.</p><p><strong>Open problem:</strong> A vendor-neutral <em>judge ID</em> schema (family, snapshot, template, temperature) with signed metadata so scores are portable and auditable across orgs.</p><p><strong>Actionable takeaway:</strong> Starting now, report <strong>(a)</strong> dual-judge results, <strong>(b)</strong> a style-controlled judge, and <strong>(c)</strong> a static-set trend line. Pin everything (judge + tasks) and ship URL-level logs with your scorecard.</p><div><hr></div><h2>Related releases</h2><ul><li><p><strong><a href="https://arxiv.org/abs/2506.21506">Mind2Web 2</a></strong> &#8212; 130 long-horizon, live-web tasks with an <em>agent-as-a-judge</em> rubric; useful patterns for judge design and source-attribution scoring. <a href="https://arxiv.org/abs/2506.21506?utm_source=chatgpt.com">arXiv</a></p></li><li><p><strong><a href="https://arxiv.org/abs/2406.12373">WebCanvas / Mind2Web-Live</a></strong> &#8212; online eval framework + live tasks with key-node checks; good reference for handling UI/content drift. <a href="https://arxiv.org/abs/2406.12373?utm_source=chatgpt.com">arXiv</a></p></li><li><p><strong><a href="https://arxiv.org/abs/2504.18575">WASP</a></strong> &#8212; security benchmark for prompt-injection against web agents; a practical pre-deployment gate. <a href="https://arxiv.org/abs/2504.18575?utm_source=chatgpt.com">arXiv</a></p></li><li><p><strong><a href="https://arxiv.org/abs/2504.12516">BrowseComp</a></strong> &#8212; &#8220;browsing competitions&#8221; that measure persistence/creativity with easily verifiable answers; handy smoke tests. <a href="https://arxiv.org/abs/2504.12516?utm_source=chatgpt.com">arXiv</a></p></li><li><p><strong><a href="https://arxiv.org/html/2508.07999v1">WideSearch</a></strong> &#8212; broad information-seeking benchmark (new); complements task-oriented suites with open-ended search. <a href="https://arxiv.org/html/2508.07999v1?utm_source=chatgpt.com">arXiv</a></p></li></ul><div><hr></div><h2>Benchmarks &amp; evals</h2><ul><li><p><strong>Online-Mind2Web leaderboard</strong> &#8212; live-site tasks + public board; sanity-check agent+judge stacks under real drift. <a href="https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderboard?utm_source=chatgpt.com">Hugging Face</a></p></li><li><p><strong><a href="https://arxiv.org/abs/2412.05467">BrowserGym ecosystem</a></strong> and <strong><a href="https://github.com/ServiceNow/BrowserGym">framework repo</a></strong> &#8212; unified runners/envs across suites; use for deterministic replays and comparable logs. <a href="https://arxiv.org/abs/2412.05467?utm_source=chatgpt.com">arXiv</a><a href="https://github.com/ServiceNow/BrowserGym?utm_source=chatgpt.com">GitHub</a></p></li><li><p><strong><a href="https://github.com/web-arena-x/visualwebarena">VisualWebArena</a></strong> &#8212; multimodal (vision+web) tasks; required if screenshots/visual grounding drive success. <a href="https://github.com/web-arena-x/visualwebarena?utm_source=chatgpt.com">GitHub</a></p></li><li><p><strong><a href="https://arxiv.org/abs/2407.15711">AssistantBench</a></strong> &#8212; realistic, time-consuming tasks; pairs well with a static slice to separate model progress from web drift. <a href="https://arxiv.org/abs/2407.15711?utm_source=chatgpt.com">arXiv</a></p></li><li><p><strong>Mind2Web 2 site</strong> &#8212; task descriptions, rubric details, and assets beyond the paper; useful when implementing judge templates. <a href="https://osu-nlp-group.github.io/Mind2Web-2/?utm_source=chatgpt.com">osu-nlp-group.github.io</a></p></li></ul><div><hr></div><h2>Requests for datasets</h2><p>We (<a href="https://www.brickroadapp.com">Brickroad</a>) pay a commission on the first closed deal when a referral leads to a purchase or placement. Details and terms &#8594; <a href="http://www.dataset.news/pitch">dataset.news/pitch</a>.</p><ul><li><p><strong>Fantasy &amp; Sci-Fi Imagery (CGI-enhanced, 4K+)</strong></p><p><em>Specs</em>: Fully rendered stills of space battles, alien worlds, mythical creatures, futuristic cityscapes, magical realms, dystopian landscapes, and character-driven action scenes. 4K+ resolution preferred; consistent aesthetic; annotations for scene description, character presence, lighting cues, etc. Sources may include 3D modeling, photorealistic rendering (Unreal/Blender/Maya), or high-end compositing. AI-generated images considered with strong fidelity and clear rights.</p><p><em>Rights</em>: Licensor must own or control all rights; commercial ML training/redistribution permitted.</p><p><em>Target price</em>: $0.10 per image</p><p><em>Use cases</em>: image-to-image, text-to-image</p></li><li><p><strong>Uncompressed Raw Video from Professional Cinema Cameras</strong></p><p><em>Specs</em>: Original camera files (uncompressed or lightly compressed RAW/ProRes) from RED, Blackmagic, ARRI Alexa, Sony Venice, Canon C-series, etc. Include metadata (resolution, frame rate, bit depth, lens, shooting conditions). Diverse environments (indoor/outdoor, low-light, high-motion); mix of handheld, gimbal, drone, tripod. Preference for consistent framing, color accuracy, and sensor-level detail. Minimum volume: &#8805;5 TB.</p><p><em>Rights</em>: Submitters must confirm data ownership and grant commercial ML training rights; publisher indemnity preferred.</p><p><em>Target price</em>: $400&#8211;$1,000 per TB (resolution, diversity, and metadata quality dependent)</p><p><em>Use cases</em>: fine-tuning generative video models; post-production automation</p></li><li><p><strong>Drone Videos of Cities (NA &amp; EU)</strong></p><p><em>Specs</em>: High-quality drone footage of North American and European cities. Include in-camera and on-drone metadata: GPS, accelerometer, gimbal data. Variety of routes, altitudes, and lighting conditions encouraged.</p><p><em>Rights</em>: Licensed for commercial AI/ML training; submitters attest to ownership/control of rights.</p><p><em>Target price</em>: $5.00 per minute</p><p><em>Use cases</em>: object detection, navigation, scene understanding</p></li></ul><p><em>We don&#8217;t feature or broker datasets with unclear rights or unsafe personal data. Always confirm licences &amp; provenance before commercial use.</em></p><p>If you need an NDA or have questions, email me at <a href="mailto:michael@brickroadapp.com">michael@brickroadapp.com</a>.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.dataset.news/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Dataset News! Subscribe for free to receive new issues.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>What did this piece get right&#8212;or miss&#8212;about judge drift and eval ops? Got a judge-swap story or a fix that saved a rollout? Leave a comment with one thing that broke and one change that worked.</p><p>If you buy data, the Brickroad marketplace is open: filter, preview, and license only the slices you need with standardized terms. <a href="https://www.brickroadapp.com">Try the marketplace today</a>.</p><p>If you supply data and want to list inventory, <a href="https://www.brickroadapp.com">upload directly</a> or email <a href="mailto:michael@brickroadapp.com">michael@brickroadapp.com</a>.</p><p>&#8212; Michael Gordon, cofounder &amp; CTO at Brickroad</p><div><hr></div><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://www.dataset.news/p/judge-swaps-drifting-agent-evals?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Found this useful? This post is public&#8212;<strong>share it!</strong></p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.dataset.news/p/judge-swaps-drifting-agent-evals?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.dataset.news/p/judge-swaps-drifting-agent-evals?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p></p>]]></content:encoded></item></channel></rss>