How to get URL link on X (Twitter) App
1) Sliding window attention (e.g., as popularized by Mistral): This technique uses a fixed-sized attention block that allows a current token to attend to only a specific number of previous tokens instead of all previous tokens, as illustrated in the figure below.
Essentially, you just have to prompt the Llama 3 8B Instruct model with a pre-query template, and it will generate an instruction for you. Then, feed that instruction back to the LLM, and it will generate a response for you.
Why another LoRA alternative, now with high ranks? LoRA (I'd say that's by design) updates the original weights in a relatively limited well, which is great for tasks like instruction finetuning but relatively ineffective continued pretraining, which requires more capacity. 2/7
Rotary positional embeddings (aka RoPE) have been a recent cornerstone of modern LLM implementations since it supports flexible sequence lengths. In this paper, researchers propose Position Interpolation to increase RoPE-based context window sizes to 32,768 tokens
In short, it works by using a new loss function during training, such that each node in the output layer will represent a probability (whether the target label exceeds a given rank). This can then be converted in to the predicted rank label.
First, why would we want to consider training LLMs for multiple epochs given the enormous amounts of data? Turns that high-quality text data on the internet is more slower than required. Also, if copyrighted material is removed in the future, this it might shrink further
Self-Instruct is one (almost annotation-free) way to align pretrained LLMs with instructions as illustrated in the figure above.
In-context learning (= providing examples of the task are provided in the input) is actually super useful when labeled data is scarce or inaccessible. And it's handy when we don't have direct access to the LLM, i.e, when interacting with the LLM via UI or API.
In my experience, it makes total sense if we want to apply LLMs to novel data sources (e.g., protein amino acid sequences as ProtBERT demonstrated).
https://twitter.com/rasbt/status/1642161887889567745
Prefix finetuning falls into the "soft prompt" category. In regular hard prompt tuning, we optimize the choice of input tokens to get the desired response.
Btw I will release the code under Apache 2.0 license. I am not planning to enforce it. I filed it to prevent big tech or patent trolls enforcing it.
https://twitter.com/rasbt/status/1641801360462041089
The intuition is that a proper context can steer the LLM towards performing a desired task without the need of updating the LLM's parameters. We learn a set of tokens called a "prefix" that (conditioned on by the model) guides the model's output toward the desired behavior.
Before going into low-rank adaption finetuning & prefix finetuning next, let's take a step back and briefly go over the classic 3 (using a classification context as an example):
In contrast to LLaMA-Alpaca, it's not finetuning the whole model end-to-end. Instead, the Adapter-approach adds a small number of 1.2M parameters on top of a pretrained, frozen 7B LLaMA model (as shown in the figure above)
https://twitter.com/rasbt/status/1637803700944093184
The next trend will likely be extending the capabilities with vision, other modalities, and multitask training.
https://twitter.com/rasbt/status/1639625228622917632BERTScore can be used for translations and summaries, and it captures the semantic similarity better than traditional metrics like BLEU and ROUGE. In particular, it's more robust to paraphrasing.
https://twitter.com/rasbt/status/1639271663735828483Where BLEU is commonly used for translation tasks, ROUGE is a popular metric for scoring text summaries. Similar to BLEU, it's usually applied to n-grams, but for simplicity, we will focus on 1-grams (single words). There are quite some similarities between BLEU and ROUGE.
https://twitter.com/rasbt/status/1638895926399107073BLEU was originally developed to capture or automate the essence of human evaluation of translated text.
Instead of using human-generated instruction-output pairs, they retrieve the data by querying the GPT-3-based text-davinci-003 model. So, Alpaca essentially uses a form of weakly supervised or knowledge-distillation-flavored finetuning.*