Dataset News

Dataset News

Home
Pitch
About
Refreshing general-knowledge benchmarks without leakage
Static QA tests are showing their age; how to refresh without rewarding recollection.
Aug 21 • 
Michael Gordon
Judge swaps, drifting agent evals
Leaderboards move when the judge changes. Here’s how to keep web-agent evaluations stable enough to buy, benchmark, and ship against.
Aug 14 • 
Michael Gordon
© 2025 Michael Gordon
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture