Dataset News

Dataset News

Home
Pitch
About
Refreshing general-knowledge benchmarks without leakage
Static QA tests are showing their age; how to refresh without rewarding recollection.
Aug 21, 2025 • Michael Gordon
Judge swaps, drifting agent evals
Leaderboards move when the judge changes. Here’s how to keep web-agent evaluations stable enough to buy, benchmark, and ship against.
Aug 14, 2025 • Michael Gordon
© 2026 Michael Gordon · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture