Planned
:bulb:
Setup comparability checker
Let users select any two evaluation results and get a field-by-field comparison of their setups: temperature, max tokens, harness, split, metric variant, evaluator. End with a clear verdict (“directly comparable” / “differs in: harness, split”). Right now the comparability flag tells you that two scores diverge; this would let anyone check why before quoting a comparison.