Imagine chatting with an AI assistant that can contextually reference your conversations from weeks or months ago. Or summarizing reports that span thousands of pages. StreamingLLM makes this possible by enabling language models to smoothly handle endless texts without losing steam.
Current LLMs are like students cramming for an exam - they can only memorize a limited context. StreamingLLM is the valedictorian with a photographic memory of everything you've ever discussed.
It works by identifying and preserving the model's inherent "attention sinks" - initial tokens that anchored its reasoning. Combined with a rolling cache of recent tokens, StreamingLLM delivers up to 22x faster inference without any drop in accuracy.
You know that irksome feeling when chatbots forget your earlier conversations? StreamingLLM abolishes that frustration. It remembers the touchdowns from your last game and your newborn's name without missing a beat.
Monumental books, verbose contracts, drawn out debates - StreamingLLM takes them all in its stride. No shortcuts, no forgetfulness. It's like upgrading your assistant's RAM to handle heavier workloads flawlessly.
Key advantages of StreamingLLM over other approaches are:
- Enables infinite length streaming without needing to increase model capacity or fine-tune the model. Methods like expanding the attention window require substantially more compute for training larger models.
- Much more efficient than alternatives like sliding window with recomputation. StreamingLLM provides up to 22x speedup per token decoding.
- Stable performance on texts massively longer than training length. Matches performance of sliding window baseline on 4 million+ tokens.
- Simple and versatile. Easily incorporated into models with relative position encoding like RoPE or ALiBi.
- Pre-training with sink token further boosts streaming capability with just a single token.
- Decouples model pre-training length from actual generation length. Allows extending model use cases.
So in summary, StreamingLLM unlocks efficient and performant infinite-length streaming from pretrained LLMs without modulation, and is superior to other techniques on metrics like speed, compute overhead and length generalizability.
Here's the StreamingLLM method:
Background:
- Autoregressive language models are limited to contexts of finite length due to the quadratic self-attention complexity.
- Windowed attention partially addresses this by caching only the most recent tokens.
- However, window attention fails once initial tokens are removed from the cache.
Key Idea:
- Initial tokens act as "attention sinks", collecting scores irrespective of meaning due to the softmax normalization.
- Preserve these tokens alongside a rolling cache to stabilize attention distributions.
Approach:
- Split KV cache into two segments:
- Attention Sinks: Consists of the first 4 tokens of the sequence.
- Rolling Cache: Holds the most recent T tokens.
- For each new decoded token:
- Evict the oldest token from the rolling cache.
- Append the new token to the rolling cache.
- Compute self-attention over [Attention Sinks + Rolling Cache].
- Apply relative position encoding within the cache rather than full sequence.
For too long, the capabilities of large language models have been constrained by their reliance on human-crafted prompts. SocraticAI provides a more natural paradigm for AI collaboration and reasoning.
SocraticAI simulates fluid human discussion through three distinct AI agents - Socrates, Theaetetus, and Plato. Modelled after Plato's dialogues, each agent plays a specialized role in collectively uncovering solutions. Socrates artfully poses probing questions, while Theaetetus actively engages in reasoned debate. Plato scrutinizes their logic as a meticulous proofreader.
This cooperative framework removes the need for rigid, pre-defined prompting. Instead, the AI agents organically shape their own discourse, leveraging each other's diverse viewpoints to illuminate the problem space from multiple angles. Their autonomous exchange of knowledge and ideas promotes greater creativity than any single agent could achieve alone.
SocraticAI allows AI to truly learn through dialogue - questioning, explaining, and building upon new insights as they emerge. The collaborative autonomy more closely mirrors human cognition and conversation than prompt-based approaches.
Integrated access to external resources also enriches the agents' reasoning abilities. Consult WolframAlpha to verify facts. Execute Python code to implement solutions on the fly. The framework smoothly incorporates these tools into the conversational flow.
Unlock the full potential of your AI and witness the collective intelligence that emerges through Socratic discussion. SocraticAI pioneers a new paradigm for AI collaboration that transcends reliance on human prompting. Let your models engage in organic, multi-faceted problem solving through the power of peer learning. The future of AI is social.
SocratiAI is based on The Theory of Anamnesis. This ia a concept proposed by the ancient Greek philosopher Plato. The main ideas behind this theory are:
- Anamnesis means "recollection" or "remembrance" in Greek. The theory states that our minds are born possessing all knowledge, and learning is the process of recollecting or remembering facts that our souls already know.
- Plato suggested that our souls exist before our birth and possess knowledge of universal truths, as they can access the eternal, unchanging world of abstract forms and ideas. However, at birth this knowledge is forgotten.
- Through questioning and critical thinking, one can rediscover the knowledge innate to their soul, and distinguish truth from falsehood. This recollected knowledge is more reliable than information perceived through the imperfect physical senses.
- Plato illustrated this idea through Socrates' interactions, like his dialogue with the uneducated slave boy in the text Meno. Socrates demonstrates how the boy, through insightful questioning, can "recollect" mathematical truths despite no prior education.
- The theory implies existence before birth and the immortality of the soul. Plato viewed philosophical enlightenment as rediscovering abstract, universal knowledge embedded in one's psyche.
- Some later philosophers like Leibniz and Jung adopted similar ideas about inherent knowledge within the subconscious that can be brought to light through reflection.
In summary, the Theory of Anamnesis proposes that learning is not acquiring new knowledge, but rediscovering latent knowledge already present within us from before birth. Questioning and critical reasoning helps extract these forgotten truths buried in our minds.
The two key points about using Socratic dialogues among LLMs:
A. Theory of anamnesis suggests LLMs contain innate knowledge that can be extracted through discussion
- The theory of anamnesis proposes that the mind already contains all knowledge, but it must be drawn out through questioning and reflection.
- Similarly, large language models are trained on massive datasets, giving them substantial innate knowledge about language and the world.
- However, LLMs struggle to properly contextualize and apply this knowledge without human prompting. Their responses often reflect dataset biases or limitations.
- Engaging multiple LLMs in Socratic dialogue is analogous to the way Socrates led his interlocutors to uncover truths through discussion.
- By exchanging perspectives and critically examining each other's logic, the LLM agents can elicit their latent knowledge.
- One LLM plays the role of Socrates, skillfully asking guiding questions to lead the others towards valid solutions and insights.
- The process of collaborative reasoning helps extract knowledge the LLMs possess but cannot articulate on their own without the right prompts.
B. Removes need for task-specific prompts, allows LLMs to engage in free-form inquiry
- Typically, humans must carefully engineer prompts for each specific task to coerce useful behaviors from LLMs.
- Socratic dialogue removes the need to create custom prompt templates for every new problem.
- Instead, the LLMs themselves construct prompts and responses aimed at collectively tackling the challenge.
- Without fixed constraints imposed by pre-defined prompts, the agents have autonomy to freely explore different lines of inquiry.
- The conversational format allows them to organically shape the direction of the discussion and build on each other's contributions.
- Their inquiry is not limited to a single static prompt, enabling more flexibility and creativity in exploring solutions.
- This freer flow of ideas can unlock problem-solving abilities that would otherwise require laborious prompt engineering.
In summary, Socratic dialogue provides an alternative paradigm that leverages LLMs' innate knowledge without relying on fixed prompting formats. The exchange of ideas and questioning allows them to demonstrate stronger reasoning and discovery.
Have you ever felt frustrated by the factual inconsistencies and falsehoods generated by large language models? As helpful as these models can be, their tendency to "hallucinate" incorrect information threatens their reliability and hinders real-world deployment.
We have exciting news - there is now a straightforward way to improve the truthfulness of LLMs without any model retraining or integration of external knowledge. Introducing DoLa, a novel decoding strategy that guides LLMs to generate content more grounded in factual knowledge they acquired during pretraining.
DoLa works by contrasting the knowledge differences between the model's own layers. It amplifies factual signals from later layers while filtering out syntactically plausible but incorrect predictions coming from earlier layers. This simple but clever approach steers the model's next-word predictions towards factual responses.
The beauty of DoLa lies in its simplicity and effectiveness. It significantly boosts truthfulness and factual accuracy across a variety of tasks with just a small inference-time computation. DoLa consistently outperformed existing methods for mitigating falsehoods in LLMs.
Crucially, DoLa provides a plug-and-play solution that requires no retraining of model parameters or integration of external knowledge sources. It delivers substantial truthfulness gains through inference-time decoding alone.
The future possibilities are also exciting. By dynamically surfacing the factual knowledge already present within LLMs, DoLa lays a strong foundation for safer and more reliable language models. Its inferences could be further enhanced by combining DoLa with retrieval mechanisms to ground predictions on factual knowledge bases as well.
Key observations of how factual knowledge evolves across transformer layers in LLMs:
- Research has shown transformer LMs encode different types of information in different layers.
- Earlier layers tend to capture lower-level syntactic and linguistic patterns. They can recognize if a sequence of tokens has the right syntactic structure to form a plausible sentence, even if it does not contain factual information.
- Later layers incorporate more semantic meaning and factual knowledge that is aggregated throughout the model.
- For example, given a question asking for a factual date or name entity, the model may assign high probability to syntactically plausible but incorrect answers in earlier layers.
- As it passes through later layers, the probability of the factually accurate entity increases as additional factual knowledge gets incorporated.
- Meanwhile, the probability of the syntactically plausible but incorrect answer stays relatively constant across layers.
- This indicates the model progressively shifts from relying on linguistic patterns to factual knowledge when moving from early to later layers.
- The differences in knowledge between layers can be quantified by computing the JS divergence between early exiting and final layer distributions.
- By contrasting an early layer that produces syntactically plausible outputs with a later layer that produces factually accurate outputs, the model can be steered towards more factual generations.
In summary, transformer LMs exhibit a hierarchical structure in which early layers focus on linguistic plausibility while later layers incorporate more factual knowledge. Contrasting these differences can guide the model to generate more truthful content.
Related work
Hallucinations in LLMs:
- LLMs can generate false content not based on training data, referred to as "hallucinations"
- Caused by imperfect learning, decoding strategies, model overconfidence, etc.
- Mitigation strategies involve RL fine-tuning, distillation, self-consistency checks, debate models
Knowledge distribution in transformers:
- Interpretability research shows linguistic knowledge evolves across transformer layers
- Earlier layers capture syntax, later ones capture semantics and facts
- Knowledge "neurons" found concentrated in top layers of BERT
- Specific layers edit factual knowledge when directly manipulated
Contrastive Decoding:
- Decodes by contrasting expert and amateur LMs to improve fluency and coherence
- Typically contrast large and small LMs as expert/amateur
- Amateur LM fixed throughout, doesn't adapt to example difficulty
- DoLa contrasts model's own layers, dynamically selects premature layer
Promptbreeder employs large language models like GPT-3 to iteratively improve text prompts. But here's the magic - it doesn't just evolve the prompts themselves. It also evolves how the prompts are generated in the first place.
Let's break it down. Promptbreeder initializes a population of prompt variations for a task. It tests them out to see which perform best. The winners are "mutated" - modified in some way - and inserted back into the population. Rinse and repeat.
But Promptbreeder makes the mutations smarter over time. It uses the AI to generate "mutation prompts" - instructions for how to mutate and improve a prompt. And it evolves better and better mutation prompts.
So Promptbreeder is constantly getting better at getting better. It's a self-improving, self-referential loop, with natural language as the substrate. No messy neural network fine-tuning required.
The results are prompts that are specialized and highly optimized for specific applications. On math, logic, and language tasks, Promptbreeder outperforms other state-of-the-art prompting techniques.
This is a taste of the future. Soon AIs could help us find the right words, crystallize fuzzy ideas, and turn chaotic thoughts into elegant expressions. Promptbreeder demonstrates the potential for language models to be creative collaborators, not just passive tools.
Key details about the Promptbreeder approach:
Overview:
- Evolves a population of task-prompts and associated mutation-prompts using an LLM-driven evolutionary algorithm.
- Task-prompts are prompts that condition the LLM to improve its performance on a task.
- Mutation-prompts are used by the LLM to generate variations of task-prompts.
Initialization:
- Initializes population by generating varied task-prompts using mutation-prompts, "thinking styles", and the problem description.
- Creates diversity by combining different mutation-prompts and thinking styles.
- Each individual in the population has a task-prompt set and associated mutation-prompt.
Fitness Evaluation:
- Evaluates a task-prompt's fitness by its performance on a batch of training examples.
- Allows evolution to optimize prompts for the specific dataset.
Mutation Operators:
- Uses 9 operators like direct mutation, EDA mutation, hyper-mutation of mutation-prompts.
- Operators generate new prompts from scratch, modify existing prompts, or induce from examples.
- Different operators employ diverse strategies to explore the space.
Self-Referential Mechanism:
- Key innovation is co-evolving the mutation-prompts using the LLM itself.
- By mutating the mutation-prompts over generations, the system improves how it improves the task-prompts.
- This self-referential process lets the LLM recursively explore better ways to explore prompts.
So in summary, Promptbreeder co-evolves task and mutation prompts in a self-referential loop to automatically find highly optimized prompts for a specific task.
Here are the 9 mutation operators used in Promptbreeder:
1. Zero-order Prompt Generation: Generates entirely new prompt from scratch using the problem description. Provides fresh prompts not dependent on existing ones.
2. First-order Prompt Generation: Mutates existing prompt using the associated mutation-prompt. The core mutation operation.
3. Estimation of Distribution (EDA) Mutation: Generates new prompts based on patterns in a filtered list of current population. Aims to extrapolate improvements.
4. EDA Rank and Index Mutation: Same but prompts listed in fitness order to guide sampling.
5. Lineage Based Mutation: Provides LLM with chronological list of historical best prompts to guide new prompt generation.
6. Zero-order Hyper-Mutation: Generates entirely new mutation-prompt from scratch. Provides fresh starting points.
7. First-order Hyper-Mutation: Mutates existing mutation-prompt using a "hyper-mutation prompt". Allows co-evolution of mutation operators.
8. Working Out to Task-Prompt (Lamarckian): Generates new task-prompt from previous successful working out steps. Learns from past good solutions.
9. Prompt Crossover: Combines parts of two parent task-prompts.
This diverse set of operators employs different strategies: generating entirely new prompts/mutations, modifying existing ones, learning from past successes, combining information, etc. This helps thoroughly explore the space of possible prompts.
Umberto Eco may be one of the most extreme consumers of books that ever lived. He died in 2016. He was also a prolific writer. We must thus ask, what did he say that's extremely relevant to GPT4 and other Large Language Models (LLMs)?
Umberto Eco's ideas in semiotics and interpretation can be particularly relevant and insightful when considering the functioning and development of today's Large Language Models (LLMs) like GPT-4. Here are several ways in which Eco’s theories intersect with the principles underlying these models:
1. Openness to Interpretation:
Eco’s Concept: Eco proposed the idea of "The Open Work," suggesting that texts are open to multiple interpretations based on the reader’s background, culture, and experiences.
Relevance to LLMs: LLMs generate responses based on the input they receive, and the interpretation of this input can vary widely. The “openness” of the model’s responses to different interpretations reflects Eco’s concept, as the model tries to predict what a human might say, which can be inherently subjective and open-ended. 2. Unlimited Semiosis:
Eco’s Concept: Eco introduced the idea of “unlimited semiosis,” meaning that the interpretation of signs can lead to an infinite chain of meanings.
Relevance to LLMs: LLMs, when generating text, can exhibit a form of unlimited semiosis as they can produce an almost infinite variety of responses based on the input, context, and the vast amount of information they have been trained on. 3. Intertextuality:
Eco’s Concept: Eco’s notion of intertextuality implies that texts derive meaning from their relationship with other texts.
Relevance to LLMs: LLMs are trained on diverse and extensive corpora of text, learning relationships, references, and connections between different texts, which is a form of practical intertextuality. 4. Role of the Reader/Interpreter:
Eco’s Concept: Eco emphasized the active role of the reader in constructing meaning from a text, introducing the idea of the “Model Reader.”
Relevance to LLMs: Users of LLMs play a similar role to Eco’s “Model Reader” as they interact with the model, ask questions, and interpret the generated responses. The effectiveness of the model depends on how well it aligns with the user’s expectations and understanding. 5. Cultural Units and Codes:
Eco’s Concept: Eco talked about “cultural units” and the importance of codes in understanding and interpreting signs within a cultural context.
Relevance to LLMs: LLMs need to understand and generate text that is culturally appropriate and coherent. They learn cultural codes and units implicitly from their training data, which influences their performance and ability to generate culturally relevant responses.
Umberto Eco adapted and expanded Charles Sanders Peirce's concept of abduction in explaining the process of interpretation and meaning production. Eco used the concept of abduction to explain how readers generate meaning by inferring the intentions of the author and the conventions of the genre.
Some key points:
- Abduction is a form of logical inference that generates new knowledge and ideas. Unlike deduction or induction, abduction infers explanatory hypotheses.
- Eco evokes abduction to explain how readers produce interpretations of a text by creatively inferring meanings, authorial intent, intertextual links etc. that are not explicit.
- Abductive inferences are fallible guesses, unlike deductive proofs. But they create new semantic insights.
- Readers abduce the presence of a Model Author or intent based on textual evidence. Similarly for intertextual connections.
- Eco argues that abductive inferences rely on the reader's encyclopedic knowledge and semiotic competence. We infer based on our cultural codes.
- Abduction provides an open-ended plurality of interpretations, not definitively true but textually permissible.
- While fallible, abduction is guided by semantic and textual constraints. Radically incoherent abductions are invalid.
- Eco distinguishes between semantic and critical interpretations. The latter critically evaluate textual consistency, morality, values etc.
- For Eco, abduction operates within the limits set by authorial intent and textual manifestion. Interpretive validity depends on this.
In essence, Eco appropriated abduction as a semiotic explanatory principle regarding how readers infer textual meanings, create intertextual links, posit authorial intent etc. based on encyclopedic knowledge and cultural codes. It provides for an open plurality of meaning without lapsing into interpretive relativism.
Thought Cloning could enable a revolutionary leap in AI capabilities. For the first time, agents would not just blindly mimic human behaviors, but gain insight into the underlying thought processes behind those behaviors. Just as language transformed human cognition, teaching agents to think in natural language could vastly expand their reasoning, planning, and general intelligence.
Consider how children learn. We don't just show them what to do, we explain why and how we are doing it. This allows them to abstract principles that transfer to novel situations. Thought Cloning aims to bring this kind of apprenticeship learning to AI.
By learning from datasets where people narrate their thoughts as they act, agents can link low-level behaviors to high-level mental deliberations. This could allow integrating powerful cognitive models like long-term planning, causal reasoning, imagination, and goal-setting that are beyond today's AI.
We already know language models can perform human-level reasoning in narrow domains. Thought Cloning offers a path to grounding this reasoning in the physical world of action. The compositional structure of language would allow agents to efficiently explore massive spaces of strategies and mental models to solve problems.
Thought Cloning also provides innate transparency. We could literally peer into the mind of the AI to understand its intentions, diagnose mistakes, and correct undesirable thinking. This built-in interpretability facilitates safety and alignment.
Thinking in language refers to the process of using words and linguistic constructs to reason, make plans, explore ideas, and direct one's internal cognitive processes. Some key aspects of thinking in language include:
- Using words and sentences to articulate thoughts, goals, hypotheses, beliefs, etc. Rather than thinking purely in abstract concepts or visualizations, putting thoughts into language gives them more definition and structure.
- Combining and manipulating words and phrases to generate new ideas, make inferences, weigh alternatives, analyze probabilities, etc. The compositional nature of language allows combining known concepts in novel ways.
- Using dialog with oneself to work through problems, imagine scenarios, clarify thinking, surface assumptions, etc. Language provides a tool for internal discussion.
- Employing narrative and storytelling to link concepts, predict outcomes, establish causality, etc. Language allows building mental models to explain phenomena.
- Leveraging the efficiency and nuance of linguistic constructs like metaphors, analogies, if-then statements, negation, irony, etc. to add color, texture, and depth to thinking.
- Drawing on the vast lexicon of words to precisely capture subtle shades of meaning in one's thoughts. Having a wide vocabulary provides more expressive power.
- Using grammatical structures like conditionals ("if X then Y"), hypotheticals ("imagine if..."), questions ("what if we...?) etc. to explore possibilities and hypotheses.
- Harnessing the fluidity of language to rapidly update one's thinking in response to new inputs, perspectives, or evidence. Thinking aloud facilitates adaptability.
In summary, thinking in language facilitates sophisticated cognitive capabilities like causal reasoning, conceptual combination, long-term planning, mental simulation, deductive logic, and creative ideation that exceed what is possible with purely non-lingual cognition. It provides a uniquely potent medium for deliberation, imagination, and problem solving.
The Thought Cloning framework:
A. Dataset
- Contains trajectories aligning human thinking (in natural language) with actions
- Each trajectory has:
- A mission defined in natural language
- At each timestep:
- An observation (e.g. visual input)
- A thought (e.g. a sentence like "I should open the door")
- An action (e.g. walk forward)
- Data could come from sources like narrated gameplay videos and transcripts
B. Bi-level architecture
1. Upper level - Thought generator
- Encodes mission and previous thoughts (e.g. with LSTM)
- Generates a thought (sentence in natural language) at each timestep
- Objective is to imitate human thoughts in the dataset
2. Lower level - Action policy
- Encodes mission, observations, and thought from upper level
- Generates action at each timestep (e.g. movement in environment)
- Objective is to imitate human actions in the dataset
- Conditional on thoughts from upper level
C. Training
- Jointly train the upper and lower levels
- Upper level trained to clone human thoughts via cross-entropy loss
- Lower level trained to clone human actions via cross-entropy loss
- Actions conditioned on upper level thoughts via teacher forcing then annealing
Just watched a full-length movie on Youtube: "Genius." It's a true story about the novelist Thomas Wolffe and his editor. Interestingly, the editor took a great effort to cut the size of the book, perhaps for economic reasons. But you have to ask, was this a good or a bad thing?
Kevin Kelly @kevin2kelly has a recent interview where he discusses the writing process as bottomless. That he only begins a book when he sets a deadline. Writing a book is a battle against perfection and completeness.
@kevin2kelly I had just finished writing a book that I began at the beginning of the pandemic. It was triggered by a hunch, that framing the development of artificial intelligence with respect to the development of empathy would be enlightening. gum.co/empathy