Tired of beam search and all the heuristics needed to make it work well in MT?
In our work accepted at #NAACL2022 (co-lead @tozefarinhas) we explore an alternative decoding method that leverages neural metrics to produce better translations!


1/14 Image
The most common method to obtain translations from a trained MT model is to approximately compute the *maximum-a-posteriori* (MAP) translation with algorithms like beam search

However many works have questioned the utility of likelihood as a proxy for translation quality.

2/14 Image
In parallel, significant progress has been made recently in improving methods for Quality Estimation (QE) and evaluation of translated sentences by using pretrained LMs, with metrics such BLEURT or COMET(-QE) achieving high correlations with human judgments of quality.

In this work, we leverage these advances and propose *quality-aware decoding*. The gist is to first extract candidate translations stochastically or deterministically from your model and *rank* them according to one or more QE and/or reference-based neural metrics.

4/14 Image
We explore using beam search, vanilla and nucleus sampling for generating candidates and two core ranking methods: N-best list reranking for QE metrics and Minimum Bayes Risk (MBR) decoding for reference-based metrics. We explore variations of them and even combine both!

5/14 ImageImage
Crucial for this method to work is the use of good metrics for the ranking. We explore various QE (eg. COMET-QE and TransQuest) and reference-based metrics (eg. COMET and BLEURT), many of them top submissions to their respective WMT shared tasks!

We experimented with quality-aware decoding across two model sizes and four datasets, comparing to beam search (BS) baselines in multiple metrics. We explore the impact of the candidate generation method, the number of candidates and the ranking method/metrics used.

We found quality-aware decoding scales well with number of candidates, especially when these were generated with stochastic methods.

Using ancestral sampling underperforms the BS baseline, but we can mitigate this with biased sampling techniques such as nucleus sampling!

8/14 Image
We also found that quality-aware decoding with neural-based metrics improves the quality of translations according to the same & other neural-based metrics!

9/14 Image
However, this comes at cost in lexical metrics. To investigate if we aren't *overfitting* the neural metrics (reducing correlation with humans) we perform a human evaluation.

While there is an overfit, quality-aware decoding still outperforms BS in human quality!

10/14 ImageImage
Our work is very similar to the concurrent work by @markuseful et al. (arxiv.org/abs/2111.09388) although we explore extensively the impact of metric and ranking procedures while they focus more on MBR with BLEURT, diving deeper into how translations differ from beam search

The overfitting problem identified was also something that was explored more in-depth concurrently by @chantalamrhein (arxiv.org/abs/2202.05148) using MBR with COMET.

All our code is available at github.com/deep-spin/qawa…

We made a simple python package that should allow you to run quality-aware decoding (both reranking and MBR) in a few lines!

13/14 Image
I would like to thank this super team (@tozefarinhas @RicardoRei7 @accezz @a_ogayo @gneubig @andre_t_martins) for all their help!

This was work done within the scope of @CMUPortugal's MAIA project.


• • •

Missing some Tweet in this thread? You can try to force a refresh

Keep Current with Patrick Fernandes

Patrick Fernandes Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!


Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @psanfernandes

Apr 25
Trying to interpret your neural networks but out-of-the-box methods aren't working?
In our new preprint, we propose a framework for automatically learning to explain NN decisions!


co-led @MarcosTreviso, w/ @danish037 @andre_t_martins @gneubig

While many works propose methods for extracting explanations from neural networks, the interpretability community is still trying to figure out what explanations are supposed to *achieve* and how to *evaluate* them.

Recently, some have argued for the use of *simulability*: how much do explanations help humans/other models predict the decisions of the model being explained.

Read 14 tweets

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!


0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy


3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!