Ever since MAEGE (aclanthology.org/P18-1127/) I have a soft spot for evaluation of evaluation = EoE (especially when they are automatic, but without is still ok).
Capture formality - XLM-R with regression not classification
Preservation - with chrf not BLEU
Fluency - XLM-R but there is room for improvement
System Ranking - XLM-R and chrf
Crosslingual Transfer - rely on zero shot not machine translation