The problem with LLM agent frameworks is that they need a different level of abstraction. Chaining workflows together are too rigid and brittle. Do humans wire each other to cooperate? We need more dynamic consensus-building abstractions. We need systems that anticipate and are robust to multiple failures while persistently seeking its goals.
What's surprising is that this new frontier is very predictable under the lens of C.S.Peirce's Architectonic. Ideas from more than a century ago. iep.utm.edu/peircear/
Feb 8 • 5 tweets • 6 min read
1/n No Search, No Problem: Achieving Grandmaster Level Using Only a Transformer
A new research paper presents a groundbreaking advancement in chess-playing artificial intelligence, demonstrating for the first time that it is possible to train a neural network to play chess at a grandmaster level without relying on explicit search techniques. This finding challenges the long-held belief that sophisticated search algorithms are indispensable for mastering complex games like chess.
Historically, chess AIs such as Deep Blue and AlphaZero have depended on robust evaluation functions, extensive opening books, and advanced search techniques like alpha-beta pruning and Monte Carlo tree search to anticipate future moves. The question of whether neural networks could achieve expert-level play through supervised learning alone, without the computational overhead of search algorithms, remained open until now.
The breakthrough came by harnessing the power of modern transformers, scaled up to 270 million parameters, and training them on a dataset of 10 million human chess games annotated with strategic evaluations by the Stockfish 16 chess engine. This approach allowed the neural network to predict Stockfish's evaluations of new board positions accurately.
The performance of this neural network is exceptional, surpassing AlphaZero's value and policy networks, solving 93.5% of a wide range of chess puzzles, and achieving a blitz rating of 2895 on Lichess, a score higher than that of most grandmasters. Remarkably, this was achieved without employing any search strategies beyond evaluating all potential next moves.
This significant finding reveals that with enough model capacity and a substantial training dataset, it is possible to distill the complex search and evaluation algorithms of advanced chess engines like Stockfish into the parameters of a neural network. This represents a paradigm shift, suggesting that capable chess AIs can be developed without the need for manually designed heuristics or search algorithms.
The success of this approach underscores the potential of using transformers and self-supervised learning to approximate complex algorithms, opening new avenues for research into how far this technique can eliminate the need for search in strategic reasoning and its applicability to other domains. This work not only marks a milestone in AI chess but also signals a broader implication for the future of artificial intelligence in strategic reasoning tasks.2/n Method details
Here is a detailed overview of the method used in the paper to create a transformer-based chess engine:
Data Collection and Annotation
- Download 10 million chess games played by humans on Lichess
- Extract all unique board positions from these games
- For each board position, use the Stockfish 16 chess engine to compute:
- State-value: Win percentage prediction (0-100%)
- Action-values: Win percentage for all legal moves
- Best move: Move with highest action-value
- This results in over 15 billion state-action pairs annotated with Stockfish evaluations
- Use a standard transformer architecture from recent LLMs
- 8 attention heads
- Post-layer normalization
- 270 million parameters
- Input representation: 77-token encoding of current board FEN string
- Output heads for value regression and action classification
- Train the transformer to predict the Stockfish values using standard supervised learning
- Cross-entropy loss for classification over value bins
- Adam optimizer
- Train for 10 million steps (2.7 epochs)
- Batch size 4096 on 128 TPUs
- Construct three policies based on network outputs:
1. Choose move with highest predicted action-value
2. Choose move that minimizes predicted next-state value
3. Pick highest probability move from policy head
- Assess performance on:
- Puzzles: % solved correctly
- Prediction accuracy: State-value MSE, action accuracy
- Chess rating: Elo score from games against humans and bots
Feb 7 • 7 tweets • 5 min read
1/n The Self-Discovery That's Redefining Reasoning
The self-discover method outlined in a new paper from Google marks a significant advancement in enhancing the reasoning capabilities of large language models (LLMs). It breaks away from the limitations imposed by predefined paradigms, allowing models to create unique reasoning structures tailored to each task. This flexibility not only improves performance but also provides valuable insights into structured reasoning.
Traditionally, language models have struggled with a one-size-fits-all approach to reasoning, leading to challenges in handling diverse tasks. While methods like step-by-step prompting have shown promise, they often fall short when faced with tasks requiring alternative reasoning flows. Self-discover addresses this issue by dynamically composing reasoning building blocks, enabling models to identify relevant modules and integrate them into customizable workflows.
Moreover, this approach overcomes the rigidity of human-authored templates, which are often suboptimal for unfamiliar domains. By granting models the freedom to create bespoke scaffolding through directed composition, rather than imposing logic chains from the top down, self-discover embraces the inherent complexity of reasoning. This leads to significantly improved performance on multifaceted tasks while maintaining efficiency in inference.
Analysis further reveals that the structures generated by self-discover exhibit transferability across models, indicating universal traits. This methodology provides transparent insights into how models encode reasoning processes, resembling compositional hierarchies found in human cognition. While there may be performance plateaus in the future, self-discover represents an exploratory venture into emergent reasoning by artificial agents, transcending the constraints imposed by human boundaries.
By prioritizing student-driven synthesis of reasoning forms over predefined routines, this inquiry unlocks previously inconceivable problem-solving patterns for models. It heralds an era where we can learn as much from machines about chained cognition as they can learn from our elucidations. This illumination of structure genesis across models advances efforts to cultivate generalizable, composable thought.2/n Here are some key pain points of existing systems for improving language model reasoning, and how Self-Discover addresses them:
1. Reliance on fixed reasoning paradigms:
- Existing methods like chain-of-thought rely on a predetermined reasoning approach ill-suited for all tasks.
- Self-Discover allows models to compose task-specific structures from modular blocks.
2. Lack of flexibility:
- Methods depend on human-authored decompositions or structures.
- Self-Discover enables models to self-direct structure creation.
3. Failure to adapt structure to task:
- Even learned approaches optimize one structure for all tasks.
- Self-Discover discovers custom structures per task, unlocking greater reasoning potential.
4. Inference inefficiency:
- Ensemble and multi-sample approaches are computationally expensive.
- Self-Discover matches or exceeds their performance with 10-40x fewer calls.
In summary, by enabling language models themselves to flexibly compose reasoning building blocks suited to novel tasks, Self-Discover overcomes the brittleness, inflexibility, and inefficiency of existing reasoning systems.
The automated discovery process allows capturing unique reasoning patterns for each task in a way that static approaches cannot. This self-directed composition of reasoning workflows is the critical driver of enhanced performance.
Feb 3 • 6 tweets • 7 min read
1/n Discovered this book (h/t @Extended_Brain). Let's look into some nuggets of wisdom! 2/n In his chapter on "Personal Knowledge," Michael Polanyi argues that all knowledge involves personal participation and commitment on the part of the knower. He introduces the concept of "tacit knowing" to describe the process by which personal knowledge is accumulated. Tacit knowing stands in contrast to the ideals of detached objectivity and value neutrality often associated with scientific knowledge.
At the heart of tacit knowing is subsidiary awareness—attending to one thing by focusing on another related or connected thing. For example, we may identify a person by his clothes, or we attend to the weight of a hammer in our palm as we focus on driving the nail. What we are focally aware of and what we are subsidiarily aware of mutually depend on each other in tacit knowing. Our subsidiary awareness of clues, instruments, and context allows us to comprehend the focal target, while the target itself determines what counts as clues or instruments relevant to discerning its nature.
Tacit knowing pervades multiple forms of skillful achievement, including practical skills like cycling and swimming but also more abstract capabilities like reading comprehension or facial recognition. It has a from-to structure—we go from perception of subsidiaries to comprehension of a coherent whole. This always involves our active shaping and organizing of subsidiaries to integrate them for meaning.
Polanyi identifies three key aspects to tacit knowing: functional, phenomenal, and semantic. The functional aspect is the from-to relation itself and how we dwell in the particulars to attend to the whole. The phenomenal aspect is that through integrative acts like binocular vision or reading, we achieve a new phenomenal experience beyond what direct inspection of the parts would indicate. Finally, the semantic aspect is the meaning-giving relationship where subsidiaries acquire their sense by bearing on the focus.
An important implication is that all knowledge depends on personal judgment to turn clues into comprehension. There are no explicit rules determining what coheres or what is meaningful. As Polanyi puts it, "into every act of knowing there enters a tacit and passionate contribution of the person knowing what is being known." While aiming at an external reality, our understanding relies fundamentally on internal processes of integration that connect knower and known. Tacit knowing is an inescapable and universal feature of human knowledge.
Feb 2 • 4 tweets • 4 min read
1/n A Taxonomy for Multi-Modal Large Language Models
The architecture consists of 5 key components:
1. Modality Encoder: Encodes inputs from modalities like image, video, audio into feature representations. Common options include NFNet-F6, ViT, CLIP ViT, C-Former, etc.
2. Input Projector: Aligns non-text modality features to the text feature space of the LLM. This uses cross-attention, Q-Former, P-Former, or simple MLPs/linear layers.
3. LLM Backbone: Core large language model that processes aligned multi-modal representations and generates textual outputs + signal tokens for conditional generation. Popular choices are Flan-T5, Vicuna, OPT, LLaMA, etc.
4. Output Projector: Maps signal token representations into features that can be understood by the Modality Generator. Uses a Tiny Transformer or MLP.
5. Modality Generator: Generates outputs in modalities like image, video, audio conditioned on the mapped features. Typically uses off-the-shelf latent diffusion models like Stable Diffusion, AudioLDM, etc.
The training pipeline has 2 key stages -
1. Multi-Modal Pre-Training: trains the Input and Output Projectors using image-text, video-text, audio-text datasets to align modalities. May fine-tune small trainable parameters in LLM backbone using methods like prefix tuning.
2. Multi-Modal Instruction Tuning: further trains the model on instruction-formatted datasets using reinforcement learning from human feedback. This enhances model's alignment with human preferences and interaction capabilities.2/n The input process flow
1. Modality Encoder:
- Encodes inputs from modalities like image, video, audio into feature representations.
- Input: An image of a cat
- CLIP ViT encoder encodes it into a 768-d feature vector representing the visual concepts in the image
2. Input Projector
- Projects non-text modality features into the textual feature space of LLM
- The 768-d cat image feature from CLIP ViT
- A linear layer projects it into a 1024-d vector aligned with text vector space
- Other options like cross-attention, Q-Former can also achieve this alignment
3. LLM Backbone
- Core large language model that processes the aligned multi-modal representations
- The 1024-d projected cat image feature vector
- Textual caption describing the image: "A cute cat playing with a ball of yarn"
- These text and image features are fed into the LLM backbone like OPT or LLaMA
- The LLM encodes them into a joint representation in its latent space and generates relevant outputs
So in summary, the modality encoders create non-text representations, input projectors transform them into an LLM-compatible space, and LLM backbone fuses information from all aligned modalities to understand concepts across modalities. The flow enables the fusion of multi-modal knowledge into the LLM.
Feb 1 • 7 tweets • 4 min read
1/n Introducing RAPTOR
Existing RAG methods suffer from a major limitation: they can only retrieve short, contiguous passages of text. This restricts their capacity to represent cross-document discourse structure and leverage thematic information scattered across lengthy corpora. As a result, performance suffers on complex questions requiring multi-step inference or synthesis of knowledge from multiple sections.
Fixed language models also face challenges staying up-to-date, as baking vast world knowledge into model parameters makes it arduous to edit or append facts. Yet relying on outdated embedded knowledge severely impairs real-world reliability and accuracy.
This paper introduces RAPTOR, a novel recursive abstraction paradigm that overcomes both issues through hierarchical multi-document representation. RAPTOR segments text, then recursively clusters, summarizes, and embeds passages. This structures corpora into multi-layer trees encoding information at varying levels of abstraction.
Querying this rich tree representation allows integrating details and high-level themes simultaneously. Controlled experiments exhibit consistent improvements over baseline retrievers across several QA datasets. Moreover, by augmenting powerful readers like GPT-4, RAPTOR reaches new state-of-the-art results on multifaceted reasoning tasks requiring nuanced understanding of lengthy narratives.
Modularizing knowledge into RAPTOR’s index also facilitates updating world facts. As corpus contents evolve, the reader persists unaltered, flexibly adapting to current information needs. This crucial agility makes RAPTOR invaluable for dynamic real-world deployments.
In summary, RAPTOR provides a sorely lacking solution for multi-document reasoning and updatable retrieval-based QA. Leveraging recursive summarization and abstraction, it encodes corpora with sufficient semantic depth for complex queries. RAPTOR delivers substantial gains; its strong empirical performance confirms the merits of tree-based hierarchical retrieval augmentation.2/n The RAPTOR process:
1. Text Segmentation
- Split retrieval corpus into short, contiguous chunks of 100 tokens, similar to traditional methods
- Keep sentences intact even if over 100 tokens to preserve coherence
2. Text Embedding
- Embed text chunks using SBERT to get dense vector representations
- Employ soft clustering using Gaussian Mixture Models and UMAP dimensionality reduction
- Vary UMAP parameters to identify global and local clusters
- Use Bayesian Information Criterion for model selection to determine optimal number of clusters
- Summarize the chunks in each cluster using a language model
- Results in a condensed summary capturing key information
5. Node Creation
- Clustered chunks + corresponding summary = new tree node
6. Recursive Processing
- Repeat steps 2-5: Re-embed summaries, cluster nodes, generate higher level summaries
- Forming a multi-layer tree from the bottom up
- Until clustering is infeasible (final root node summarizes the entire corpus)
- Two methods: tree traversal (top-down layer by layer) or collapsed tree (flattened view)
- For each, compute cosine similarity between query and nodes to find most relevant
So in summary, RAPTOR leverages recursive clustering and summarization of text chunks to create a hierarchical tree structure for more effective contextual retrieval.
Jan 30 • 8 tweets • 6 min read
1/n Exploiting Large Language Models (LLMs), RAG and KGs for Creative Design
A recent paper makes a compelling case for the tremendous yet untapped potential of large language models (LLMs) to transform materials science research. However, the authors thoughtfully acknowledge critical "pain points" in relying solely on the raw capabilities of LLMs in this complex domain. Accuracy, nuance, interpretability, reasoning - on all these fronts, LLMs fall short without a guiding hand.
That's exactly why this paper shines. It outlines strategies to partner with LLMs to elicit their strengths while overcoming weaknesses. Retrieval augmentation (RAG) provides lacks context to ground responses. Knowledge graphs (KGs) organize concepts ontologically to lend structure and meaning. Non-linear prompting channels creativity through critical filters. Diverse model collectives enable cooperative discovery.
What emerges is a vision for a new paradigm - LLMs not as opaque oracles, but as flexible components in an intelligible, distributed materials discovery infrastructure. One where human researchers set the objectives, models rapidly compound knowledge through code and data, and reciprocal feedback loops drive exploration.
This paper thus makes a timely case. That to fully actualize the manifest benefits of AI in advancing materials science, we must raise these powerful models to collaborators in a hybrid intelligence system built on transparency, trust, and shared creativity fueled by human curiosity.2/n Main strategies covered:
1) Retrieval-augmented generation (RAG) methods to inject additional knowledge into the generative process to improve accuracy. RAG is highlighted as a powerful approach, especially when combined with graph-based methods.
2) Ontological knowledge graphs to provide interpretable structure that captures concepts and relationships. This facilitates mechanistic insights and more detailed responses from the LLM.
3) Nonlinear sampling techniques like tree-of-thought prompting to iteratively refine and improve responses, overcoming limitations of single-shot linear sampling.
4) Multi-agent models where specialized LLMs collaborate and interact autonomously to solve complex multimodal problems. Illustrates promise for advanced applications like automated force-field development.
Jan 23 • 11 tweets • 2 min read
1/n The most important civic duty that a nation can instill in its citizens is the importance of life-long learning. This goes beyond access to education for our children. It involves a culture that leans toward healthy collaboration and drives toward sustained innovation.
2/n It is no surprise that so many citizens feel left out in today's system. People have never learned the skills to learn independently. But AI radically remedies this deficit! GPT-like systems are tireless teachers who can adapt their conversations to a student's cognitive biases and limitations.
Jan 19 • 7 tweets • 5 min read
1/n Let's talk about Flow Enginneering that's discussed in the AlphaCodium paper:
The paper introduces the concept of "flow engineering" to characterize their proposed approach of AlphaCodium, and contrasts it with typical "prompt engineering" methods. The use of the term "flow engineering" can be justified in the following ways:
1. Multi-stage iterative process: AlphaCodium involves a structured, test-driven flow with progressive stages - problem analysis, test generation, initial coding, and iterative run-fix cycles. This goes beyond crafting an optimal prompt.
2. Incorporating code execution: The flow deeply integrates execution of the generated code against input-output examples into the modeling process, rather than purely focusing on static prompt tuning. This dynamic run-fix iteration on increasing tests sets it apart.
3. Scaffolding code development: The multi-step methodology provides a scaffolding that mirrors the software development process by incrementally going from specifications to code, resembling test-driven cycles.
4. Code-centric techniques: Several techniques tailor-made for code tasks supplement the basic flow - modular code prompting, test anchors prevent code divergence, output validation using test suites.
5. Knowledge accumulation: Each stage in the AlphaCodium flow builds up artifacts, learnings and validated components which are accumulated to aid downstream steps - a departure from one-off prompt engineering.
In summary, the use of the term "flow engineering" underscores the process-centric, execution-backed, and code-aware nature of the methodology going beyond static prompt design. It better captures the iterative, test-driven, development-mimetic essence.2/n This paper is entirely fascinating in that it introduces an entirely novel way of viewing subsequent reasoning processes that influence both long chains of inference as well as subsequent retraining.
The paper proposes several code-oriented design concepts and best practices:
1. YAML Structured Output:
- Ask the model to generate output in YAML format conforming to a given Pydantic class definition.
- Eliminates need for complex prompt engineering, allows complex structured answers.
- More suitable than JSON for code due to handling of quotes, special chars etc.
2. Semantic Reasoning via Bullet Points:
- When asking the model to reason about a problem, use bullet point format.
- Forces splitting into logical sections, improves understanding.
3. Modular Code Generation:
- Ask the model to divide code into small sub-functions with meaningful names.
- Results in better code quality, easier iterative fixing.
4. Soft Decisions with Double Validation:
- Avoid strict decisions by the model which lead to hallucinations.
- Double validate potentially erroneous outputs.
5. Postponing Decisions and Exploration:
- Gradually move from easier to harder tasks, avoiding irreversible decisions early on.
- Leave room for exploring multiple possible solutions.
6. Test Anchors:
- Fix codes incorrectly when iterating on potentially invalid AI-generated tests.
- Use already passed tests as anchors to detect erroneous fixes.
It incorporates many of the best practices of agile software development in a machine learning optimization process!
Jan 19 • 5 tweets • 4 min read
1/n Have you ever wondered why decoder-only Transformer models like GPT-4 have dominated over other Transformer models like encoder-only (ex: BERT) or encoder-decoder models (ex: Flan T5)? What is the intuitive explanation for this?
To understand their supremacy, consider how the models process text. Envision reading a book trilogy seeing the story unfold. The encoder skims each volume, absorbing plot details. The encoder-decoder additionally summarizes chapters.
The decoder, however, exclusively predicts the next sentence based on the prior context. Like an engaged reader passionate to know what happens next, it continuously guesses upcoming passages.
This laser-focused anticipatory specialty allows decoders to achieve superior language mastery. Much as chess masters gain expertise from years studying game patterns, the decoder augments its prowess by repeatedly forecasting text sequences.
For example, after digesting 10,000 books, GPT-3 can plausibly simulate authors from King to Rowling. Its next-token predictions over such massive corpora yield granular stylistic knowledge exceeding other designs.
Decoders also thrive by concentrating computational resources. Directing all parameters toward forward-looking generation rather than divided labors enables stratospheric scale. GPT-3, with over 175 billion parameters predicting the next word, has read more than any entity in human history.
Finally, mastery of sequencing unlocks robust few-shot learning. As with chess grandmasters who can adapt their play given a new opening sequence, GPT-3 flexibly brings its expertise to unfamiliar contexts by conditioning on a few demonstrations. No fine-tuning needed!
A decoder-only transformer like GPT has significant advantages over other architectures that make it more capable:
1. Autoregressive modeling specialization: By attending only to previous tokens, GPT specifically optimizes its parameters for sequence generation tasks. GPT focuses all capacity on forward prediction.
2. Scalability: Transformer architectures parallelize more efficiently, enabling models like GPT-3 with over 175 billion parameters. GPT leverages massive scale to learn deep contextual relationships in language.
3. Long-range coherence: Attention directly links distant tokens. GPT remembers better and generates more globally coherent text.
4. Task versatility: Huge models like GPT-3 can perform well on diverse NLP tasks with just demonstration examples, no parameter updates. This "few-shot learning" comes from pre-training on enormous corpora.
In essence, by concentrating model capacity on forward sequence prediction and scaling massively, GPT gains both specialization and extensive world knowledge. The decoder architecture matches the generative use case well. And minimal fine-tuning is needed for novel tasks due to transfer learning. Together these advantages make GPT exceptionally capable as a text generator. No other current approach combines focused sequence modeling with extreme scale so effectively.
2/n The decoder's supreme predictive abilities rely heavily on proper prompting to reach their potential. Prompt engineering is, therefore, critical when applying models like GPT-4 to new tasks.
We can think of prompting as priming the decoder's pump - orienting its next-word expectations by demonstrating desired responses. Much like providing an opening chess move shapes viable continuations, supplying the decoder initial text steers subsequent generation.
Without effective prompts grounding the task, the model lacks a foundation to produce coherent completions. Nonsensical prompts beget nonsensical results. Yet when properly set up, few well-chosen examples can elicite remarkable performance thanks to few-shot learning.
Such nimble adaptation relies entirely on steering the supremely specialized decoder. By predicting next tokens that extend the prompt, it brings predictive prowess to novel domains. Prompts provide the decoder necessary context for generalization.
In essence, effective prompting allows us to harness the decoder's immense latent abilities. Like coaching an Olympic athlete on new maneuvers to add to their repertoire, thoughtful prompts let us access the decoder's full potential. The decoder reigns supreme, but proper prompting precipitates peak performance.
So while architecting a high-powered decoder is essential, prompt engineering is equally integral to unlocking its capabilities. For unlike encoders and encoder-decoders, what you prompt is what you get! Prompts serve as the steering wheel guiding the decoder’s generation down productive avenues.
Jan 4 • 6 tweets • 2 min read
1/n An ontology for hallucination mitigation techniques in Large Language Models (LLMs).
Prompt Engineering category
A. Retrieval Augmented Generation (RAG)
- Before Generation: Strategies where information retrieval happens before text generation, e.g. LLM-Augmenter
- During Generation: Retrieval at sentence level generation, e.g. Knowledge Retrieval, D&Q Framework
- After Generation: Retrieval after full text generation, e.g. RARR
- End-to-End: Integrated retrieval and generation models, e.g. Original RAG model
B. Self-Refinement through Feedback and Reasoning
- Iterative refinement of outputs through feedback, e.g. Prompting GPT-3 for Reliability
- Detection and mitigation of self-contradictions, e.g. ChatProtect
- Interactive improvement via feedback loops, e.g. Self-Reflection Methodology
C. Prompt Tuning
- Tuning instructions provided to models, e.g. UPRISE, SynTra
Developing Models category
A. New Decoding Strategies
- Guiding generation phase, e.g. Context-Aware Decoding
B. Utilization of Knowledge Graphs
- Injection of structured knowledge, e.g. RHO
C. Faithfulness Based Loss Functions
- Enhance factuality in outputs, e.g. THAM Framework
D. Supervised Fine-Tuning
- Tuning on labeled data, e.g. Knowledge Injection methods2/n Related to this is an ontology of prompting.
Prompt programming refers to the methodology of embedding large language models (LLMs) within algorithmic programs by decomposing desired complex behaviors into chains of simpler prompts. Each prompt invokes specialized reasoning skills of the LLM by providing targeted context and output specifications. The key aspects are:
1. Task Decomposition - Break down the overall expected behavior into modular steps of reasoning/generation based on capabilities of the fixed LLM.
2. Focused Prompting - Craft prompts such that each steps hides unnecessary information and focuses the LLM only on the context needed for that subtask.
3. Recursive Chaining - Link prompts in a structured program that passes LLM outputs as inputs down the chain, directing the flow of information.
4. Context Partitioning - Maintain state external to the LLM rather than passed implicitly within a prompt so prompts remain isolated.
5. Program Coordination - Architect the chaining to coordinate the prompts towards the overall task objective.
6. Component Testing - Evaluate and refine each prompt independently without interfering effects from other stages.
By algorithmically decomposing tasks and carefully interfacing prompts as LLM skills, capabilities can be unlocked compositionally that cannot be expressed within static individual prompts. The partitioning into testable modules makes this methodology systematic to analyze and enhance. The end result is effectively an orchestrated questioning of the LLM to accomplish intricate objectives.
2/n Prompt programming
- Embed pre-trained LLMs within classic computer programs to carry out more sophisticated behaviors by decomposing into simpler steps
- Recursively break down expected complex behavior into modular steps
- Design prompts and algorithms around LLM's current capabilities
- Iterate testing and improving performance of each module
- Expands LLM capabilities without requiring large-scale finetuning
- Allows incorporation of high-level algorithms and external knowledge
- Enables better generalization and interpretability
- Provides control for safety, correctness, efficiency
- QA, reasoning, dialogue by combining LLM with classic data structures and algorithms
- Multi-step open-ended tasks by repeatedly querying LLM
In summary, this paradigm treats LLMs not as monolithic models, but as components that can be programmed to expand their scope through algorithmic decomposition without compromising robustness. This offers an efficient way to capture more real-world behaviors.
Dec 27, 2023 • 4 tweets • 3 min read
26 Prompting Tips
1 - No need to be polite with LLM so there is no need to add phrases like “please”, “if you don’t mind”, “thank you”, “I would like to”, etc., and get straight to the point.
2 - Integrate the intended audience in the prompt, e.g., the audience is an expert in the field.
3 - Break down complex tasks into a sequence of simpler prompts in an interactive conversation.
4 - Employ affirmative directives such as ‘do,’ while steering clear of negative language like ‘don’t’.
When you need clarity or a deeper understanding of a topic, idea, or any piece of information, utilize the following prompts:
o Explain [insert specific topic] in simple terms.
o Explain to me like I’m 11 years old.
o Explain to me as if I’m a beginner in [field].
o Write the [essay/text/paragraph] using simple English like you’re explaining something to a 5-year-old.
6 - Add “I’m going to tip $xxx for a better solution!”
When formatting your prompt, start with ‘###Instruction###’, followed by either ‘###Example###’ or ‘###Question###’ if relevant. Subsequently, present your content. Use one or more
line breaks to separate instructions, examples, questions, context, and input data.
9 - Incorporate the following phrases: “Your task is” and “You MUST”.
10 - Incorporate the following phrases: “You will be penalized”.
11 - Use the phrase ”Answer a question given in a natural, human-like manner” in your prompts.
12 - Use leading words like writing “think step by step”.
13 - Add to your prompt the following phrase “Ensure that your answer is unbiased and does not rely on stereotypes”.
14 - Allow the model to elicit precise details and requirements from you by asking you questions until he has enough information to provide the needed output (for example, “From now on, I would like you to ask me questions to...”).
15 - To inquire about a specific topic or idea or any information and you want to test your understanding, you can use the following phrase: “Teach me the [Any theorem/topic/rule name] and include a test at the end, but don’t
give me the answers and then tell me if I got the answer right when I respond”.
16 - Assign a role to the large language models.
17 - Use Delimiters.
18 - Repeat a specific word or phrase multiple times within a prompt.
19 -Combine Chain-of-thought (CoT) with few-Shot prompts.
Use output primers, which involve concluding your prompt with the beginning of the desired output. Utilize output primers by ending your prompt with the start of the anticipated response.
21 - To write an essay /text /paragraph /article or any type of text that should be detailed: “Write a detailed [essay/text /paragraph] for me on [topic] in detail by adding all the information necessary”.
22 - To correct/change specific text without changing its style: “Try to revise every paragraph sent by users. You should only improve the user’s grammar and vocabulary and make sure it sounds natural. You should not change the writing style, such as making a formal paragraph casual”.
23 - When you have a complex coding prompt that may be in different files: “From now and on whenever you generate code that spans more than one file, generate a [programming language ] script that can be run to automatically create the specified files or make changes to existing files to insert the generated code. [your question]”.
24 - When you want to initiate or continue a text using specific words, phrases, or sentences, utilize the following prompt: o I’m providing you with the beginning [song lyrics/story/paragraph/essay...]: [Insert lyrics/words/sentence]’. Finish it based on the words provided. Keep the flow consistent.
25 - Clearly state the requirements that the model must follow in order to produce content, in the form of the keywords, regulations, hint, or instructions
26 - To write any text, such as an essay or paragraph, that is intended to be similar to a provided sample, include the following instructions: o Please use the same language based on the provided paragraph[/title/text /essay/answer].
Dec 14, 2023 • 7 tweets • 4 min read
The FunSearch paper by DeepMind that was used to discover new mathematics is an example of searching through generative patterns and employing evolutionary methods to creatively conjure up new solutions. This is a very general principle that lies at the core of creativity. deepmind.google/discover/blog/…2/n The FunSearch paper shows how pairing a large language model with evolutionary search to creatively generate and test programs led to mathematical discoveries. As the argument notes, this fundamentally relies on:
1. Generative patterns: Rather than fixed solutions, FunSearch produces programs - encapsulating generative descriptions and reusable logic spanning multiple problem instances. This focus on compact programs crystallizes the essential patterns.
2. Evolutionary creativity: The iterative evolutionary loop allows creatively building upon previous generations - introducing slight mutations while preserving high-performing components. This injection of controlled spontaneity shakes up local optima.
As noted, these two principles align closely with general models of human and biological creativity involving leveraging existing building blocks and enriching via playful exploration. By framing the search through these constructs, FunSearch productively channels a universal recipe for imagination.
The argument further states that this makes FunSearch an exemplar for effectively formalizing and mechanizing the creative process to expand knowledge frontiers. The human+machine collaboration also shows the promise of complementing logic with inspiration.
In summary, FunSearch offers a programmatic instantiation of foundational principles of creativity rooted in reuse, novelty and selection. The demonstrated mathematical advances validate the argument that codifying such creative drives can enhance discovery.
Dec 9, 2023 • 11 tweets • 3 min read
1/n Was December 8th, 2023, the day when we've come to realize that AGI technology has been democratized? That it cannot be confined to the few and the GPU-rich? Let me explain to you what happened yesterday. 2/n The first notable event was the release of @MistralAI (A French company) of the weights of a Mixture of Experts model via BitTorrent. This allowed anyone in the world access to the weights without needing to identify themselves.
Dec 5, 2023 • 6 tweets • 4 min read
1/n Breaking News! Prompt Engineering for the Win!
Instruct fine-tuning has been discovered to be unnecessary. Prompting is all you need!
A recent research paper provides compelling evidence that the extensive fine-tuning used to "align" large language models into helpful assistants may be largely unnecessary. Through detailed analysis, the authors reveal that alignment tuning does not fundamentally transform model behavior, but rather only affects stylistic elements like discourse markers and safety caveats. The vast majority of an aligned model's factual knowledge and reasoning still derives straight from its initial pre-training.
In light of this, the authors develop a radically simpler alignment method called URIAL that uses no parameter tuning whatsoever. By carefully selecting a handful of demonstrative examples and prompts that establish the desired response structure and tone, URIAL can align even the largest LLMs at inference time. The results are astounding - URIAL matches or even exceeds aligned LLMs fine-tuned with massive datasets across helpfulness, clarity, accuracy, depth, and safety!
So in essence, this research indicates that language models are already deeply knowledgeable before any alignment tuning. The tuning itself only teaches them to speak a bit nicer. By eliciting that knowledge properly through strategic prompting, we can slash compute costs and unlock AI assistants just as capable, without doing any weight updating fine-tuning whatsoever. The ramifications to the field are immense - both in terms of efficiently evaluating and comparing LLMs as well as deploying extremely performant AI with minimal alignment efforts. In summary, fancier tuning may help, but proper prompting gets us most of the way there. 2/n For more about Prompt Engineering, see my book here (20% off for today)! intuitionmachine.gumroad.com/l/gpt4/6ndg9ax
Dec 5, 2023 • 5 tweets • 4 min read
1/n The winds of change are blowing in the world of large language models. Over the past year, we have witnessed the meteoric rise of closed-source behemoths like ChatGPT that boasted impressive capabilities but operated like black boxes, charging expensive fees while providing little transparency.
The open-source community has responded to this status quo by banding together and channeling their collective intellect into building free and accessible models for the people. We are now at the cusp of an open-source LLM renaissance that promises to transform AI for the better.
Riding on the shoulders of giants like Llama and Lemur, talented teams across academia and industry are creating marvels like WizardMath that can outthink ChatGPT in mathematical reasoning. Prodigies like InstructRetro are matching GPT-3.5 in open-ended QA merely with smart pretraining and without burning exorbitant cloud computing. Shepherd produces critique as insightfully as its closed-source counterparts by training on thoughtfully curated data.
These glimpses of brilliance foreshadow a future where open-source models not only catch up but surpass the performance of commercial LLMs. We will soon be able to build safer, more aligned, and specialized AI without paying a king's ransom. No longer will critical applications of language AI remain locked in proprietary black boxes. 2/n Key Developments in Open Source LLMs.
1. General Capabilities
- Llama-2 variants surpass GPT-3.5-turbo on some benchmarks
- Zephyr-7B approaches 70B model performance via distilled direct preference optimization
- WizardLM-70B and GodziLLa-70B match GPT-3.5-turbo on certain evaluations
- But GPT-4 still leads across most metrics
2. Agent Capabilities
- Lemur-70B-chat outperforms GPT-3.5-turbo in exploring environments and coding tasks
- Other specialized models exceed GPT-3.5-turbo in tool usage (ToolLlama) and API writing (Gorilla)
3. Logical Reasoning
- WizardCoder and WizardMath surpass GPT-3.5-turbo in reasoning after enhanced instruction tuning
- Lemur and Phi leverage better pre-training data to improve abilities
4. Long Context Modelling
- Llama-2-long outperforms GPT-3.5-turbo on select benchmarks via additional pre-training
5. Application-specific Capabilities
- InstructRetro beats GPT-3.5-turbo in open-ended QA through retrieval and instruction tuning
- Specialized models exceed GPT-3.5-turbo in mental health analysis and radiology reports
- Shepherd matches GPT-3.5-turbo in generating feedback and critiques
- Methods to reduce hallucination outperform GPT-3.5-turbo via data filtering, context-aware decoding, knowledge augmentation etc.
In summary, open-source LLMs are demonstrating rapid progress, matching or exceeding GPT-3.5-turbo in a growing set of domains, even if trailing behind GPT-4 still.
Nov 28, 2023 • 5 tweets • 3 min read
Video of Daniel Kahneman and Yann LeCun discussing Dual Process Theory (i.e., System 1 and 2) in relation to Deep Learning.
Daniel Kahneman describes System 1 and System 2 as two different modes of thinking. System 1 is the fast, automatic, and effortless mode of thinking that we use for most of our daily activities. It is responsible for things like making quick decisions, understanding language, and controlling our movements. System 2 is the slower, more deliberate, and effortful mode of thinking that we use for more complex tasks, such as solving problems, making judgments, and planning for the future.
Kahneman argues that System 1 is responsible for many of the biases and errors that we make in our thinking. This is because System 1 is often based on heuristics, which are mental shortcuts that can be helpful but can also lead to mistakes.
System 2 can be used to override System 1 and correct its mistakes, but this requires effort and attention. We are more likely to use System 2 when we are motivated to be accurate, when we have the time and resources to think carefully, and when we are not under stress or cognitive load.
Yann LeCun says in the video about current AI systems being mostly based on System 1 thinking:
"Current AI systems are mostly based on System 1 thinking, which is fast and intuitive, but it's also brittle and can make mistakes that humans would never make."
"These systems are trained on data that is labeled with the correct answer, and they do not have to learn how to reason or solve problems on their own."
This framing of using Dual Process Theory is now the predominant metaphor to think about current Deep Learning and Large Language Models. I proposed this metaphor in my book in 2017 and I am glad it caught on! amazon.com/Artificial-Int…
Nov 26, 2023 • 8 tweets • 2 min read
1/n It should be pretty obvious now that a 7-14B model can best GPT-4 in specialized domains. This realization torpedoes GPU-rich firms from establishing a monopoly. One can leverage extreme asymmetric information arbitrage in the long-tail of LLM applications.
2/n Serving the common use case as ChatGPT has shown does have broad and wide applicability. But despite its utility, few customers are going to pay a premium. Furthermore, the provider is unable to determine what a premium capability is!
Nov 23, 2023 • 9 tweets • 4 min read
1/n Let me start a thread that speculates what OpenAI's Q* (Q-star) may likely to be. To narrow the scope of our exploration, let's assume that it's a derivation of a Reinforcement Learning approach (i.e., Q-learning) applied to LLMs like GPT. Will Q render judgement on humanity? 2/n At the top of my head, Q* is likely a combination of 3 recent research proposals:
1) Q* search, a Deep Learning version of A* search. 2) Q-Transformers, an offline training method inspired by Q-learning. An improvement over Decision Transformers. 3) XoT - A chain-of-thought method that exploits search to guide LLM responses.
Nov 19, 2023 • 17 tweets • 5 min read
1/n Breaking News! OpenAI has uncovered an emergent new cognitive capability, yet nobody is demanding answers! We are distracted by OpenAI governance politics and not the real issue!!! 2/n What is the breakthrough? I suspect it has to do with Retrieval Augment Generation (RAG). RAG is an architecture that allows a LLM to use a search engine to augment its reasoning. The problem has always been that the embeddings used by the search engine may not be the beneficial ones that augment the LLM's reasoning.