New work on autoregressive language models for retrieval!

We train our model, SEAL (Search Engines with Autoregressive LMs) to produce text snippets occurring somewhere in the corpus, in relevant documents. We use the generated ngrams to rank documents in the corpus.

Thread 👇
This is the result of my internship @MetaAI. Joint work with @ot_y @PSH_Lewis @scottyih @riedelcastro @Fabio_Petroni. I couldn't have asked for a better team 💕
Code & checkpoints: github.com/facebookresear…
Paper: arxiv.org/abs/2204.10628

2/n
Why is retrieval with autoregressive language models a good idea to begin with? They have incredible scaling up capabilities, are easy to train, and you don’t need to maintain a large database of document embeddings.

3/n
Yet, it is non-trivial to use them to search for documents in very large corpora. In the recent very cool Differentiable Search Index (DSI) paper from @YiTayML et al. (arxiv.org/abs/2202.06991), it is shown that large LMs can retrieve by decoding document ids directly.

4/n
In our work, we show that LMs can also be used in a more natural way: generating bits from evidence document(s). But the challenge here is that NLG is hard to control. The LMs can generate not just ungrounded evidence (= not in corpus), but also non-factual/abusive text.

5/n Constrained decoding: given the query "What do people w
Our secret weapon is to prune the output space using an indexing structure called an FM-index (en.wikipedia.org/wiki/FM-index). Using it, we find all continuations in a corpus while constrained decoding. All decoded sequences are then grounded into the document(s) they appear in.

6/n FM-index overview.
The FM-index is light (smaller than plain text!), doesn't use VRAM, can reconstruct the original text, and is independent of LM embedding size, potentially enabling the use of much larger corpora. It can be used *without substantial modifications* with larger LMs.

7/n Index statistics on NQ (around 21M passages). SEAL's index i
In SEAL, ngrams are scored by combining the (conditional) probability according to the LM and the normalized frequency in the FM-index. Documents are scored by aggregating occurring ngram scores.

8/n
But why we need SEAL if we can generate doc ids directly, w/o a separate index at all? In our replication, DSI performs well with a small document collection, but underperforms BM25 on a full Wikipedia benchmark. SEAL works well with both small and large corpora.

9/n Retrieval on NQ320k, ~180k documents. BM25: 44.5 (hits@10); Retrieval on NQ, ~21M documents. BM25: 78.1 (acc@100); DSI-B
Results are promising even when comparing against conventional retrieval systems: on Natural Questions we match or outperform DPR (acc@100&EM) while using a much smaller index. On KILT, we improve passage-level SotA by 10 points on average.

10/n Retrieval on KILT dev (~36M passages). Reporting R-precision
We believe NLG methods for retrieval are super exciting, and could change IR in the near future, making web-scale retrieval more approachable. We really hope to see more work in this direction!

last/n

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Michele Bevilacqua

Michele Bevilacqua Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(