Tired of beam search and all the heuristics needed to make it work well in MT?
In our work accepted at #NAACL2022 (co-lead @tozefarinhas) we explore an alternative decoding method that leverages neural metrics to produce better translations!
The most common method to obtain translations from a trained MT model is to approximately compute the *maximum-a-posteriori* (MAP) translation with algorithms like beam search
However many works have questioned the utility of likelihood as a proxy for translation quality.
2/14
In parallel, significant progress has been made recently in improving methods for Quality Estimation (QE) and evaluation of translated sentences by using pretrained LMs, with metrics such BLEURT or COMET(-QE) achieving high correlations with human judgments of quality.
3/14
In this work, we leverage these advances and propose *quality-aware decoding*. The gist is to first extract candidate translations stochastically or deterministically from your model and *rank* them according to one or more QE and/or reference-based neural metrics.
4/14
We explore using beam search, vanilla and nucleus sampling for generating candidates and two core ranking methods: N-best list reranking for QE metrics and Minimum Bayes Risk (MBR) decoding for reference-based metrics. We explore variations of them and even combine both!
5/14
Crucial for this method to work is the use of good metrics for the ranking. We explore various QE (eg. COMET-QE and TransQuest) and reference-based metrics (eg. COMET and BLEURT), many of them top submissions to their respective WMT shared tasks!
6/14
We experimented with quality-aware decoding across two model sizes and four datasets, comparing to beam search (BS) baselines in multiple metrics. We explore the impact of the candidate generation method, the number of candidates and the ranking method/metrics used.
7/14
We found quality-aware decoding scales well with number of candidates, especially when these were generated with stochastic methods.
Using ancestral sampling underperforms the BS baseline, but we can mitigate this with biased sampling techniques such as nucleus sampling!
8/14
We also found that quality-aware decoding with neural-based metrics improves the quality of translations according to the same & other neural-based metrics!
9/14
However, this comes at cost in lexical metrics. To investigate if we aren't *overfitting* the neural metrics (reducing correlation with humans) we perform a human evaluation.
While there is an overfit, quality-aware decoding still outperforms BS in human quality!
10/14
Our work is very similar to the concurrent work by @markuseful et al. (arxiv.org/abs/2111.09388) although we explore extensively the impact of metric and ranking procedures while they focus more on MBR with BLEURT, diving deeper into how translations differ from beam search
11/14
The overfitting problem identified was also something that was explored more in-depth concurrently by @chantalamrhein (arxiv.org/abs/2202.05148) using MBR with COMET.
Trying to interpret your neural networks but out-of-the-box methods aren't working?
In our new preprint, we propose a framework for automatically learning to explain NN decisions!
While many works propose methods for extracting explanations from neural networks, the interpretability community is still trying to figure out what explanations are supposed to *achieve* and how to *evaluate* them.
2/14
Recently, some have argued for the use of *simulability*: how much do explanations help humans/other models predict the decisions of the model being explained.