Tweet

@luke_wood_ml

@luke_wood_ml

@luke_wood_ml

@luke_wood_ml

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @martin_gorner

Martin Görner

@martin_gorner

Dec 8

How can you probe what a language model knows ? If you ask it directly, it might lie (for example because you prefixed your question with untruths, or many other reasons).
Contrast-Consistent Search (CCS) gives a way:
openreview.net/pdf?id=ETKGuby…

It takes advantage of a nice property of True/False statements: they cannot be True and False at the same time.
Take any statement "The Eiffel tower is a crab", add True/False to the end and you have two mutually exclusive statents.

You can then feed these into a language model and train a small neural network to classify them as True/False from just some of the internal activations of your model. The language model stays frozen.

Read 14 tweets

Martin Görner

@martin_gorner

Dec 5

@geoffreyhinton

Thought-provocative new paper from @geoffreyhinton: what if we could replace backpropagation with something better?

@geoffreyhinton

@geoffreyhinton I seems very unlikely that the human brain uses back propagation to learn. There is little evidence of backprop mechanics in biological brains (no error derivatives propagating backwards, no storage of neuron activities to use in a packprop pass, ...).

@geoffreyhinton

@geoffreyhinton Also, the brain can learn from a continuous stream of incoming data and does not need to stop to run a backprop pass. Yes, sleep is beneficial for learning somehow, but we can learn awake too.

Read 14 tweets

Martin Görner

@martin_gorner

Nov 9

https://twitter.com/yixuan_su/status/1590034008758312960

Contrastive Search is the new kid on the block for text generation from language models. Better than greedy or beam search, top-k, nucleus sampling etc

Can continue text from a prefix with quality indistinguishable from a human, as judged by humans
paper: arxiv.org/abs/2210.14140

https://twitter.com/yixuan_su/status/1590034008758312960

In the experiment results above, the model continues a given text and human raters evaluate the result.
The raters preferred text generated by contrastive search 60-70% of the time (green box). When comparing to human output, they were undecided (red ellipses).

Intuitively, contrastive search encourages the model to generate the most likely sequence of words without repeating itself. The decoding maximized likelihood and minimizes the cosine similarity with already generated tokens ("degeneration penalty").

Read 6 tweets

Martin Görner

@martin_gorner

Feb 15

Google's LaMDA paper arxiv.org/abs/2201.08239 shows yet another information retrieval strategy: it has been taught to ask a search engine, or a calculator .

The first answer "It [Eiffel Tower] was constructed in 1887" is generated directly, but also recognized as containing a factual statement. This sends the whole context to LaMDA-Research which is trained to generate search queries, here "TS, Eiffel Tower, construction date"

"TS" means "ToolSet", i.e. the generated text is meant for a tool, the search engine, not the user.
Info from the search, i.e. "Eiffel Tower / construction started: 28 jan 1887" is appended to the context, which is sent to LaMDA-Research again.

Read 8 tweets

Martin Görner

@martin_gorner

Feb 4

@xuefeng_du

This is sweet 🥧 !
arxiv.org/abs/2202.01197
Finally a solid way of of teaching a neural network to know what it does not know.
(OOD = Out Of Domain, i.e. not one of the classes in the training data.) Congrats @SharonYixuanLin @xuefeng_du @MuCai7

The nice part is that it's a purely architectural change of the detection network, with a new contrastive loss which does not introduce additional hyper-parameters. No additional data required !

The results are competitive with training on a larger dataset manually extended with outliers: "Our method achieves OOD detection performance on COCO (AUROC: 88.66%) that favorably matches outlier exposure (AUROC: 90.18%), and does not require external data."

Read 6 tweets

Martin Görner

@martin_gorner

Feb 4

I like the "database layer" developed by DeepMind in their RETRO architecture:
deepmind.com/blog/article/l…
It teaches the model to retrieve text chunks from a vast textual database (by their nearest neighbour match of their BERT-generated embeddings) and use them when generating text

It's a bit different from the "memory layer" I tweeted about previously, which provides a large learnable memory, without increasing the number of learnable weights. (for ref: arxiv.org/pdf/1907.05242…)

This time, the model learns the trick of retrieving relevant pieces of knowledge from a large corpus of text.
The end result is similar: an NLP model that can do what the big guns can (Gopher, Jurassic-1, GPT3) with a tenth of their learnable weights.

Read 8 tweets

Share this page!

Martin Görner

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @martin_gorner

Martin Görner

Martin Görner

Martin Görner

Martin Görner

Martin Görner

Martin Görner

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!