The recently-proposed YOLO-NAS object detector achieves a mean average-precision (mAP) of 52 on Microsoft COCO with <5 ms latency, while The YOLOv1 requires 30-50ms latency to achieve comparable mAP. Here are the three key ideas that make YoloNAS fast and performant… 🧵 [1/8]
1. Neural Architecture Search (NAS)
Authors of YOLO-NAS define a space of 10^14 possible architecture configurations inspired by YOLO-v6/v8, then discover a suite of models with different performance/latency tradeoffs by using a hardware-aware NAS algorithm called AutoNAC. [2/8]
2. Model Quantization
YOLO-NAS adopts a complex quantization strategy to minimize latency within the resulting model. First, YOLO-NAS adopts a quantization-aware structure within each of its layers and leverages quantization-aware training.
YOLO-NAS also adopts a hybrid/dynamic quantization strategy that avoids quantizing layers that damage performance. As a result, YOLO-NAS loses ~.5 mAP during post-training quantization, whereas most other models see a 1-2 point degradation in mAP. [4/8]
For an in-depth, recent analysis of post-training quantization, I would recommend the following paper. It provides a ton of awesome context and up-to-date material on the topic that is both useful and accessible.
YOLO-NAS is trained in three parts: pre-training over the Object365, training over pseudo/weakly-labeled data, and knowledge distillation from a larger model. These components allow the lightweight YOLO-NAS model to punch above its weight in performance. [6/8]
TL;DR: YOLO-NAS achieves impressive latency (200 FPS even with the largest model!) and performs well relative to other lightweight object detectors. Thanks to my friend @DataScienceHarp for sharing this interesting research with me!
YOLO-NAS is completely open-source, so you can access the model, look at training or fine-tuning examples, and read more about how it works on the associated Github page.
We can use few-shot learning and instruction prompting to solve many problems with large language models (LLMs), but what should we do when these techniques fall short? Here are three advanced prompting approaches that can be used to solve complex problems with LLMs… 🧵 [1/8]
1. Chain of Thought (CoT) Prompting
LLMs are typically poor at solving multi-step or reasoning-based problems. CoT prompting mitigates this problem by encouraging LLMs via few-shot learning to generate a step-by-step problem-solving rationale along with their final answer. [2/8]
Prior work has shown that teaching language models (e.g., via fine-tuning) to generate problem-solving rationales improves their reasoning capabilities. CoT prompting drastically improves LLM performance on commonsense, arithmetic, and symbolic reasoning benchmarks. [3/8]
Prompt engineering for language models usually involves tweaking the wording or structure of a prompt. But, recent research has explored automated prompt engineering via continuous updates (e.g., via SGD) to a prompt’s embedding. Here’s how these techniques work… 🧵 [1/8]
First, what is a prompt embedding? Given some raw text, a language model will tokenize the text (creating a list of words/sub-words) then look up each token’s associated embedding. The resulting list of token embeddings can also be referred to as a prompt embedding. [2/8]
AutoPrompt combines the original prompt input with a set of shared “trigger token” embeddings that are selected/trained to improve model performance by using gradient descent. Trigger tokens are shared across all inputs provided to the language model. [3/8]
I’m currently writing a survey/overview of important prompt engineering tactics for my newsletter. Here’s my top-5 findings so far… 🧵 [1/7]
1. Start simple
Prompt engineering is an empirical science. We need to start with a simple baseline, then slowly add complexity. Starting with a long, complex prompt wastes tokens and might perform worse than something simple. [2/7]
2. Prompt tracking and versioning
We need a mechanism to track different versions of a prompt over time. We can do this by combining git with tools like Prompt Templates in LangChain.
Many different (text-based) transformer architectures exist, but when and where should we use them? Here’s a quick list of four important transformer variants and the best applications to use them for…🧵[1/7]
To gain a better understanding of these architectures, please check out the tweet below! In this thread, we will focus on the tasks for which each architecture is most appropriate, rather than the architectures themselves. [2/7]
Decoder-only transformers are used by (large) language models. Why? Because their use of masked self-attention makes them highly compatible with next-token prediction. Each token only pays attention to prior tokens in the sequence! [3/7]
Large Language Models (LLMs) are notoriously bad at solving reasoning-based tasks. However, we can drastically improve their reasoning performance using simple techniques that require no fine-tuning or task-specific verifiers. Here’s how…🧵[1/7]
The technique is called chain-of-thought (CoT) prompting. It improves the reasoning abilities of LLMs using few-shot learning. In particular, CoT prompting inserts several examples of “chains of thought” for solving a reasoning problem into the LLM’s prompt. [2/7]
Here, a chain of thought is defined as “a coherent series of intermediate reasoning steps that lead to the final answer for a problem”. A CoT mimics how we solve reasoning problems as humans -- by breaking the problem down into intermediate steps that are easier to solve. [3/7]
Can large language models (LLMs) train themselves? Recent research indicates that the answer might be yes… 🧵 [1/7]
But, what exactly do we mean by this? One notable method of using LLMs to train other LLMs involves using these models to generate data for instruction tuning. Typically, a larger, more powerful model is used for generation. [2/7]
This technique was pioneered by the self-instruct framework. Beginning with a small set of initial tasks (including one instruction and one input-output example per task), self-instruct uses LLMs to generate more data for instruction tuning. [3/7]