Is Llama 2 special or just a better iteration of Llama 1? 🤔 Over the weekend, I had time to read the paper in which Meta released. 📖
Below are some of my findings, which you might have missed📝
🧵 1/6
🧠 A 34B version may come later after more testing
⚖️ The 7B model used a 285x token to parameter ratio, with loss still decreasing.
💰 Training the 7B would cost ~$1M in AWS compute (5$ per A100 on AWS on-demand)
🛫 Llama Chat was started before Llama 2 finished training
🧵2/6
◼️ User prompts were masked/zeroed in SFT & RLHF training
👑 Reward Model (RM) accuracy is one of the most important proxies for Chat model
🚀 Collecting data in batches helped improve the overall model, since RM and LLM where iteratively re-trained.
🧵3/6
🔢 Used Rejection Sampling (RS) to distill knowledge from 70B for a better SFT dataset
🤔 Only used RS for the first 3 versions, then extended to RS + PPO
🆕 Proposed GAtt, inspired by Context Distillation, to augment fine-tuning data for better multi-turn conversations
🧵4/6
💡 RS + RM can boost performance by 10% compared to SFT
🛠 Chat model learned to use tools.
Meta says, “…reinforcement learning proved highly effective, particularly given its cost and time effectiveness. Our findings underscore that the crucial determinant of RLHF’s success lies in the synergy it fosters between humans and LLMs throughout the annotation process.”
OpenLLaMA 13B was released and competitive with its original counterpart from MetaAI. 🚀🎉 Two months ago, the OpenLM research initiative started to create a permissively licensed open-source reproduction of Meta AI’s LLaMA! 🛫
Last week the team released the 13B weights under Apache 2.0 with evaluations on the lm-evaluation-harness by EleutherAI🔓
OpenLLaMA matches @Meta LLaMA with an avg score of 0.57, making it a perfect replacement for all your commercial use cases🥊
OpenLLaMA is developed by @younggeng and @haoliuhl from Berkeley AI Research.
Thank you for this massive contribution to the open-source and science community!👏🏻🤗
Finally had the time to read the "The False Promise of Imitating Proprietary LLMs.” paper in detail. 📚✨ Below are some of my key takeaways: 📝
🔍 Objective:
- The paper aimed to evaluate the effectiveness of models trained on GPT outputs.
🧵 1/4
💻Implementation
- collected datasets imitating ChatGPT for specific tasks or broadly imitating its behavior (0.3M–150M tokens).
- Fine-tuned LLMs (GPT-2 and LLaMA)
- Evaluated with Humans and GPT-4 (blind pairwise comparisons with ChatGPT) and on canonical NLP benchmarks
🧵 2/4
💡 Learnings:
- Imitation models learn style, not knowledge
- Improving base LLMs has the highest impact
- imitating is feasible for distilling a specific behavior for a certain task or use case as opposed to broadly matching ChatGPT capabilities
🧵 3/4
StarChat can help you:
🙋🏻♂️ Answer coding questions in over 80 languages, including Python, Java, C++ and more!
🧠 Explain concepts and help debug your code
📊 Generate sample code for data visualizations and plots in Python
💬 Iterate together to solve your coding errors
🧵2/4
We fine-tuned StarChat Beta on the new StarCoderPlus (15B) ⭐️, which is a further trained version of StartCoder on 600B tokens from the English web dataset RedefinedWeb (Faclon dataset 🦅) 🔥
StarChat and StarCoder are open and can be used for commercial use cases 🤑