Planned

:bulb:

Setup comparability checker

Let users select any two evaluation results and get a field-by-field comparison of their setups: temperature, max tokens, harness, split, metric variant, evaluator. End with a clear verdict (“directly comparable” / “differs in: harness, split”). Right now the comparability flag tells you that two scores diverge; this would let anyone check why before quoting a comparison.

1 vote

Tagged as New feature

Created Wednesday by AK