The dataset cleans tons of open source projects to have only ones with high quality committing habits
(e.g. large active projects with commits that are of significant length etc.)
We present some ways to evaluate that the meaning was kept while summarizing, so you can go beyond ROUGE
We provide a strict split that keeps some (thousand+-) repositories totally out of the training, so you can check in domain and out of domain or just be sure results are clean.
If you ever want an even larger dataset, follow the same procedure and use more repositories (we took only ones active in 2020, pick ones that are active no longer or wasn't active until now)
Ever since MAEGE (aclanthology.org/P18-1127/) I have a soft spot for evaluation of evaluation = EoE (especially when they are automatic, but without is still ok).
Capture formality - XLM-R with regression not classification
Preservation - with chrf not BLEU
Fluency - XLM-R but there is room for improvement
System Ranking - XLM-R and chrf
Crosslingual Transfer - rely on zero shot not machine translation