Judge swaps, drifting agent evals
Leaderboards move when the judge changes. Here’s how to keep web-agent evaluations stable enough to buy, benchmark, and ship against.
Agent results have been whipsawing as evaluators—the judges—change under the hood. A single score now hides a system: task design, environment realism, and the judge’s prompt and snapshot. Style sensitivity (markdown, verbosity) and time sensitivity (live web vs. static pages) further nudge rankings. The through-line: evaluation is becoming a portfolio, not a number.
Why this matters
If your procurement or model-gating depends on a score, a silent judge swap can reshuffle the leaderboard and your roadmap with it. Teams are increasingly asked to justify upgrades (or rollbacks) when a new general-purpose model lands. Without pinned judges, style controls, and time-invariant tasks, you risk evaluating judge drift more than model progress.
Gotchas and fixes
Judge drift — a silent judge swap can reorder rankings. Fix: pin judge ID (family, snapshot, temperature) and publish dual-judge deltas; fail closed on unpinned judges. See Arena-Hard-Auto for judge-split reporting.
Style leakage — verbosity/markdown inflate scores without controls. Fix: switch to style-controlled judge prompts; keep a content-only ablation for audits.
Time variance — static pages give stable trendlines; live sites drift and break. Fix: report both; tag item-level time sensitivity and freeze a monthly static slice using WebArXiv.
Scoring blind spots — rule-only checks miss genuine successes; LLM judges vary by domain/prompt. Fix: run a rule+LLM ensemble, calibrate on a human-rated slice, and publish disagreement rates, following AgentRewardBench.
Evidence
I’m grounding this piece in a few anchors: Arena-Hard-Auto for judge-split reporting and style control, AgentRewardBench for how LLM judges agree (or don’t) on web-agent trajectories, and WebChoreArena for long-horizon, reproducible chores; I use WebArXiv as the static check. See full detail below.
Arena-Hard-Auto (v2 preview)
Judge-split reporting and style control to make evaluator effects visible; good for pinning judges and prompt hygiene.
Sourcing & access*: Hugging Face collection, GitHub. Access: open.
What it is: Hard, free-form prompts with automated grading.
Size & splits: v2 preview; separate leaderboards by judge family.
Provenance: Curated “hard” prompts; baselines from named APIs.
License & use rights: Apache-2.0 — redistribution permitted with notice.
Safety & compliance: Style control mitigates verbosity/formatting bias.
Contamination risk: Public-like prompts; pin baseline versions.
Baselines/benchmarks: Dual judge leaderboards.
Operational notes: Publish judge ID with scores.
What to watch next: Judge ensembles; harder safety subsets.
Arena-Hard-Auto (v2 preview)
Judge-split reporting and style control to make evaluator effects visible; good for pinning judges and prompt hygiene.
Sourcing & access*: Hugging Face collection, GitHub. Access: open.
What it is: Hard, free-form prompts with automated grading.
Size & splits: v2 preview; separate leaderboards by judge family.
Provenance: Curated “hard” prompts; baselines from named APIs.
License & use rights: Apache-2.0 — redistribution permitted with notice.
Safety & compliance: Style control mitigates verbosity/formatting bias.
Contamination risk: Public-like prompts; pin baseline versions.
Baselines/benchmarks: Dual judge leaderboards.
Operational notes: Publish judge ID with scores.
What to watch next: Judge ensembles; harder safety subsets.
AgentRewardBench (2025)
Expert-labeled trajectories to quantify judge agreement/disagreement; useful for calibrating a rule+LLM ensemble.
Sourcing & access: arXiv paper, HF dataset, GitHub.
What it is: 1,302 web-agent trajectories across multiple web-agent benchmarks.
Size & splits: ~1.4k labeled items; documented splits on HF.
Provenance: Trajectories sourced from WebArena / VisualWebArena / AssistantBench / WorkArena(++) and reviewed by experts.
License & use rights: No explicit SPDX license on HF; treat as restricted.
Safety & compliance: May contain screenshots of websites/apps.
Contamination risk: Low for pretraining (trajectories), but judge-training leakage is possible—pin versions.
Baselines/benchmarks: Authors compare multiple LLM judges; no single judge dominates across sets.
Operational notes: Parquet on HF; repo includes scoring scripts and submission guidance.
What to watch next: Learned / finetuned judges (e.g., WebJudge), per-domain PRMs.
WebChoreArena (2025)
Long-horizon, reproducible chores that cut live-web drift; anchors a time-invariant trendline.
Sourcing & access: GitHub; leaderboard / configs in-repo.
What it is: 532 tedious, multi-step chore tasks on simulated sites (built on WebArena) stressing memory, calculation, and long-horizon context.
Size & splits: Full set + smaller subset for cost-controlled runs.
Provenance: Extends WebArena/VisualWebArena; reproducible self-hosted sites with reset scripts; leaderboard added August 2025.
License & use rights: Apache-2.0 — redistribution permitted with notice.
Safety & compliance: Simulated content reduces PD exposure vs. live web.
Contamination risk: Low for pretraining; avoid mixing generated traces back into training without URL-level logs.
Baselines/benchmarks: Reported gaps across agents and judge models; model updates can regress.
Operational notes: Runs on BrowserGym/AgentOccam; full-set costs can be high.
What to watch next: Cross-site tasks; ensemble judge reporting.
Contamination, personal data, and redistribution
Live-web agent sets implicate site terms of use and can capture incidental PII in screenshots and logs. Static sets such as WebArXiv reduce this but embedded assets may carry third-party rights. Treat AgentRewardBench as non-redistributable unless clarified; “research-only” or fair-use language is jurisdiction-dependent. Lean towards simulated environments like WebChoreArena (and WebArena variants) when you need clean redistribution, and keep URL-level logs for any real-web captures.
Operational guidance
Pin the judge: record model family/name, snapshot string, and prompt-template hash; publish with scores. See Arena-Hard-Auto for judge-split reporting.
Use style-controlled judging: adopt style-control prompts; keep a content-only ablation for audits.
Report rule + LLM ensemble: pair a task rule with an LLM judge; calibrate on a human-rated slice and publish disagreement rates, per AgentRewardBench.
Blend static and live: include a static anchor (e.g., WebArXiv) and one live suite; tag item-level time sensitivity.
Version and route datasets: treat terms-of-use-gated assets as non-redistributable by default; for Apache include NOTICE and link Apache-2.0.
Add freshness controls: rerun monthly on the same judge and a swapped judge; include a paraphrase or time-shifted slice.
Red-team the judge: vary templates and attack prompts; compare deltas to a fixed, human-rated subset before shipping a score.
What this means for teams
Judge swaps are now part of the evaluation surface. Split reporting (e.g. Arena-Hard-Auto) makes that dependence visible; learned judges and ensembles will likely normalize multi-judge scorecards rather than replace them. Static anchors and reproducible chores provide the longitudinal backbone; plan for eval portfolios, not single numbers.
Open problem: A vendor-neutral judge ID schema (family, snapshot, template, temperature) with signed metadata so scores are portable and auditable across orgs.
Actionable takeaway: Starting now, report (a) dual-judge results, (b) a style-controlled judge, and (c) a static-set trend line. Pin everything (judge + tasks) and ship URL-level logs with your scorecard.
Related releases
Mind2Web 2 — 130 long-horizon, live-web tasks with an agent-as-a-judge rubric; useful patterns for judge design and source-attribution scoring. arXiv
WebCanvas / Mind2Web-Live — online eval framework + live tasks with key-node checks; good reference for handling UI/content drift. arXiv
WASP — security benchmark for prompt-injection against web agents; a practical pre-deployment gate. arXiv
BrowseComp — “browsing competitions” that measure persistence/creativity with easily verifiable answers; handy smoke tests. arXiv
WideSearch — broad information-seeking benchmark (new); complements task-oriented suites with open-ended search. arXiv
Benchmarks & evals
Online-Mind2Web leaderboard — live-site tasks + public board; sanity-check agent+judge stacks under real drift. Hugging Face
BrowserGym ecosystem and framework repo — unified runners/envs across suites; use for deterministic replays and comparable logs. arXivGitHub
VisualWebArena — multimodal (vision+web) tasks; required if screenshots/visual grounding drive success. GitHub
AssistantBench — realistic, time-consuming tasks; pairs well with a static slice to separate model progress from web drift. arXiv
Mind2Web 2 site — task descriptions, rubric details, and assets beyond the paper; useful when implementing judge templates. osu-nlp-group.github.io
Requests for datasets
We (Brickroad) pay a commission on the first closed deal when a referral leads to a purchase or placement. Details and terms → dataset.news/pitch.
Fantasy & Sci-Fi Imagery (CGI-enhanced, 4K+)
Specs: Fully rendered stills of space battles, alien worlds, mythical creatures, futuristic cityscapes, magical realms, dystopian landscapes, and character-driven action scenes. 4K+ resolution preferred; consistent aesthetic; annotations for scene description, character presence, lighting cues, etc. Sources may include 3D modeling, photorealistic rendering (Unreal/Blender/Maya), or high-end compositing. AI-generated images considered with strong fidelity and clear rights.
Rights: Licensor must own or control all rights; commercial ML training/redistribution permitted.
Target price: $0.10 per image
Use cases: image-to-image, text-to-image
Uncompressed Raw Video from Professional Cinema Cameras
Specs: Original camera files (uncompressed or lightly compressed RAW/ProRes) from RED, Blackmagic, ARRI Alexa, Sony Venice, Canon C-series, etc. Include metadata (resolution, frame rate, bit depth, lens, shooting conditions). Diverse environments (indoor/outdoor, low-light, high-motion); mix of handheld, gimbal, drone, tripod. Preference for consistent framing, color accuracy, and sensor-level detail. Minimum volume: ≥5 TB.
Rights: Submitters must confirm data ownership and grant commercial ML training rights; publisher indemnity preferred.
Target price: $400–$1,000 per TB (resolution, diversity, and metadata quality dependent)
Use cases: fine-tuning generative video models; post-production automation
Drone Videos of Cities (NA & EU)
Specs: High-quality drone footage of North American and European cities. Include in-camera and on-drone metadata: GPS, accelerometer, gimbal data. Variety of routes, altitudes, and lighting conditions encouraged.
Rights: Licensed for commercial AI/ML training; submitters attest to ownership/control of rights.
Target price: $5.00 per minute
Use cases: object detection, navigation, scene understanding
We don’t feature or broker datasets with unclear rights or unsafe personal data. Always confirm licences & provenance before commercial use.
If you need an NDA or have questions, email me at michael@brickroadapp.com.
What did this piece get right—or miss—about judge drift and eval ops? Got a judge-swap story or a fix that saved a rollout? Leave a comment with one thing that broke and one change that worked.
If you buy data, the Brickroad marketplace is open: filter, preview, and license only the slices you need with standardized terms. Try the marketplace today.
If you supply data and want to list inventory, upload directly or email michael@brickroadapp.com.
— Michael Gordon, cofounder & CTO at Brickroad