Suggestions

:bulb:

Benchmark health dashboard

A view for benchmark authors showing how their benchmark lives in the wild: how many models report it, which splits and metric variants people actually use, its metadata completeness against the corpus median, and where reported scores diverge. It tells eval developers exactly what to standardize or document next.

1 vote

Tagged as New feature

Created Wednesday by AK