Product Manager for Keras and Tensorflow high-level APIs. Previously worked on Cloud TPUs (Tensor Processing Units). Passionate about democratizing ML.
Jan 16 • 19 tweets • 5 min read
The "Self-Extend" paper promises magic for your LLMs: extending the context window beyond what they were trained on. You can take an LLM trained on 2000 token sequences, feed it 5000 tokens and expect it to work. Thread 🧵
(SWA below=sliding window attn.) arxiv.org/abs/2401.01325
To be fair, some LLMs can already do that, if they are trained with a specific positional encoding like Alibi (). And before LLMs, Recurrent Neural Networks (RNNs) could do this trick as well. But was lost in Transformers.arxiv.org/abs/2108.12409
Dec 29, 2022 • 11 tweets • 3 min read
Large Language Models are getting good at formal logic: arxiv.org/abs/2212.13894 LAMBADA: Backward Chaining for Automated Reasoning.
This paper is, in part, a traditional algorithm, a "depth-first search algorithm over the facts and the rules", starting from the desired conclusion and trying to logically reach the premises (facts and rules).
Dec 8, 2022 • 14 tweets • 4 min read
How can you probe what a language model knows ? If you ask it directly, it might lie (for example because you prefixed your question with untruths, or many other reasons).
Contrast-Consistent Search (CCS) gives a way: openreview.net/pdf?id=ETKGuby…
It takes advantage of a nice property of True/False statements: they cannot be True and False at the same time.
Take any statement "The Eiffel tower is a crab", add True/False to the end and you have two mutually exclusive statents.
Thought-provocative new paper from @geoffreyhinton: what if we could replace backpropagation with something better? @geoffreyhinton I seems very unlikely that the human brain uses back propagation to learn. There is little evidence of backprop mechanics in biological brains (no error derivatives propagating backwards, no storage of neuron activities to use in a packprop pass, ...).
Nov 9, 2022 • 6 tweets • 2 min read
Contrastive Search is the new kid on the block for text generation from language models. Better than greedy or beam search, top-k, nucleus sampling etc
Can continue text from a prefix with quality indistinguishable from a human, as judged by humans
paper: arxiv.org/abs/2210.14140
In the experiment results above, the model continues a given text and human raters evaluate the result.
The raters preferred text generated by contrastive search 60-70% of the time (green box). When comparing to human output, they were undecided (red ellipses).
Feb 15, 2022 • 8 tweets • 2 min read
Google's LaMDA paper arxiv.org/abs/2201.08239 shows yet another information retrieval strategy: it has been taught to ask a search engine, or a calculator .
The first answer "It [Eiffel Tower] was constructed in 1887" is generated directly, but also recognized as containing a factual statement. This sends the whole context to LaMDA-Research which is trained to generate search queries, here "TS, Eiffel Tower, construction date"
Feb 4, 2022 • 6 tweets • 4 min read
This is sweet 🥧 ! arxiv.org/abs/2202.01197
Finally a solid way of of teaching a neural network to know what it does not know.
(OOD = Out Of Domain, i.e. not one of the classes in the training data.) Congrats @SharonYixuanLin @xuefeng_du@MuCai7
The nice part is that it's a purely architectural change of the detection network, with a new contrastive loss which does not introduce additional hyper-parameters. No additional data required !
Feb 4, 2022 • 8 tweets • 2 min read
I like the "database layer" developed by DeepMind in their RETRO architecture: deepmind.com/blog/article/l…
It teaches the model to retrieve text chunks from a vast textual database (by their nearest neighbour match of their BERT-generated embeddings) and use them when generating text
It's a bit different from the "memory layer" I tweeted about previously, which provides a large learnable memory, without increasing the number of learnable weights. (for ref: arxiv.org/pdf/1907.05242…)
Feb 2, 2022 • 7 tweets • 2 min read
I'm humbled by the recent advances in NLP. I was testing this Keras model on @huggingface (huggingface.co/keras-io/trans…) using the abstract of a random (but good) ML article: arxiv.org/pdf/2002.09405…
Q: "Which examples of simulated environments are given in the text ?"
A: "fluids, rigid solids, and deformable materials"
👍 spot on
Aug 2, 2021 • 5 tweets • 2 min read
Here is Mask R-CNN, the most popular architecture used for object detection and segmentation.
The conceptual principle of the R-CNN family is to use a two-step process for object detection: 1) a Region Proposal Network (RPN) identifies regions of interests(ROIs) 2) The ROIs are cut from the image and fed through a classifier.
Jul 19, 2021 • 4 tweets • 2 min read
The MobileNet family of convolutional architectures uses depth-wise convolutions where the channels of the input are convolved independently.
Their basic building block is called the "Inverted Residual Bottleneck", compared here with the basic blocks in ResNet and Xception (dw-conv for depth-wise convolution).
Jun 28, 2021 • 4 tweets • 2 min read
I made a ton of ML architecture illustrations for an upcoming book. Starting with good old Alex Net
Now reading the ARC paper by @fchollet. arxiv.org/abs/1911.01547 “On the measure of intelligence” where he proposes a new benchmark for “intelligence” called the “Abstraction and Reasoning corpus”.
Highlights below -> @fchollet Chess was considered the pinnacle of human intelligence, … until it was solved by a computer and surpassed Garry Kasparov in 1997. Today, it is hard to argue that a min-max algorithm with optimizations represents “intelligence”.
Sep 13, 2018 • 7 tweets • 3 min read
Google Cloud Platform now has preconfigured deep learning images with Tensorflow, PyTorch, Jupyter, Cuda and CuDNN already installed. It took me some time to figure out how to start Jupyter on such an instance. Turns out it's a one liner:
Detailed instructions: 1) Go to cloud.google.com/console and create an instance (pick the Tensorflow deep learning image and a powerful GPU)
Jan 19, 2017 • 8 tweets • 5 min read
I believe a dev can get up to speed on neural networks in 3h and then learn by himself. Ready for a crash course? /1
Got 3 more hours ? The "Tensorflow without a PhD" series continues. First a deep dive into modern convolutional architectures: .