Michele Bevilacqua Profile picture
Researcher. ex @SapienzaNLP/@MetaAI. Natural Language Processing. He/him.

Apr 25, 2022, 11 tweets

New work on autoregressive language models for retrieval!

We train our model, SEAL (Search Engines with Autoregressive LMs) to produce text snippets occurring somewhere in the corpus, in relevant documents. We use the generated ngrams to rank documents in the corpus.

Thread 👇

This is the result of my internship @MetaAI. Joint work with @ot_y @PSH_Lewis @scottyih @riedelcastro @Fabio_Petroni. I couldn't have asked for a better team 💕
Code & checkpoints: github.com/facebookresear…
Paper: arxiv.org/abs/2204.10628

2/n

Why is retrieval with autoregressive language models a good idea to begin with? They have incredible scaling up capabilities, are easy to train, and you don’t need to maintain a large database of document embeddings.

3/n

Yet, it is non-trivial to use them to search for documents in very large corpora. In the recent very cool Differentiable Search Index (DSI) paper from @YiTayML et al. (arxiv.org/abs/2202.06991), it is shown that large LMs can retrieve by decoding document ids directly.

4/n

In our work, we show that LMs can also be used in a more natural way: generating bits from evidence document(s). But the challenge here is that NLG is hard to control. The LMs can generate not just ungrounded evidence (= not in corpus), but also non-factual/abusive text.

5/n

Our secret weapon is to prune the output space using an indexing structure called an FM-index (en.wikipedia.org/wiki/FM-index). Using it, we find all continuations in a corpus while constrained decoding. All decoded sequences are then grounded into the document(s) they appear in.

6/n

The FM-index is light (smaller than plain text!), doesn't use VRAM, can reconstruct the original text, and is independent of LM embedding size, potentially enabling the use of much larger corpora. It can be used *without substantial modifications* with larger LMs.

7/n

In SEAL, ngrams are scored by combining the (conditional) probability according to the LM and the normalized frequency in the FM-index. Documents are scored by aggregating occurring ngram scores.

8/n

But why we need SEAL if we can generate doc ids directly, w/o a separate index at all? In our replication, DSI performs well with a small document collection, but underperforms BM25 on a full Wikipedia benchmark. SEAL works well with both small and large corpora.

9/n

Results are promising even when comparing against conventional retrieval systems: on Natural Questions we match or outperform DPR (acc@100&EM) while using a much smaller index. On KILT, we improve passage-level SotA by 10 points on average.

10/n

We believe NLG methods for retrieval are super exciting, and could change IR in the near future, making web-scale retrieval more approachable. We really hope to see more work in this direction!

last/n

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling