Carlos E. Perez Profile picture
Sep 24, 2020 5 tweets 1 min read Read on X
Raise your hand if you are get triggered by machine learning people who claim to understand intelligence when they never read a word on cybernetics, semiotics, enactivism or ecological psychology? #ai
More generally, they have never read any text about the importance of subjectivity to intelligence.
Seems like the 'shut up and calculate' mode of science continues to dominate the agenda. :-(
The saddest thing is that these are the same people who want to drive the conversation on AI ethics.
No wonder that we are heading straight over a cliff. This mechanistic objective view of reality leads us to building nothing but a mechanistic objective reality. Good luck living in a future world being nothing but a cog in the machinery.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Carlos E. Perez

Carlos E. Perez Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @IntuitMachine

Sep 15
1/n Terrence Tao, arguable the most gifted living mathematician has tried GPT-o1 and this is his verdict: "However, this was an improvement over previous models, whose capability was closer to an actually incompetent graduate student. It may only take one or two further iterations of improved capability (and integration with other tools, such as computer algebra packages and proof assistants) until the level of "competent graduate student" is reached."Image
2/n Here, Tao attempts to use o1 to formulate the problem in Lean (a math theorem prover). Placing blame on o1's ignorance of Lean's latest capabilities. Here's the link: chatgpt.com/share/bb0b1cfa…
Image
3/n OMG. Just looking at the conversation between Tao and o1 in the previous example could make anybody insecure about their reasoning abilities! I would think that for basic problems, asking o1 to formulate Lean code for subsequent proof should be easy. What does a mediocre but competent graduate student mean with normal real-world problems?!Image
Read 7 tweets
Aug 27
1/n Why Even the Best LLMs Still Struggle with True Creative Writing

The rapid evolution of Large Language Models (LLMs) has fueled both excitement and apprehension. While their ability to mimic human language and generate coherent text is undeniable, a crucial question lingers: can AI truly be creative? The paper "Pron vs Prompt: Can LLMs Challenge World-Class Fiction Authors?" tackles this question head-on, exploring the nuanced realm of creative writing to assess whether LLMs can compete with the best human storytellers.

The paper identifies a key pain point in current AI research: the tendency to compare LLMs to average human writers. While exceeding average performance is notable, it doesn't address whether AI possesses the ingenuity and artistry of a master wordsmith. To bridge this gap, the researchers designed a unique experiment pitting GPT-4, a leading LLM, against Patricio Pron, an award-winning novelist. This head-to-head contest aimed to provide a definitive answer to whether AI can truly rival human creativity at its peak.

Previous research, while valuable, often focused on different aspects of AI and creative writing. Some explored human-AI collaboration, where AI tools assisted human writers, while others highlighted the limitations of LLMs in maintaining narrative coherence or generating truly original content. This paper distinguishes itself by focusing on autonomous LLM creative writing, directly comparing the output of GPT-4 to Pron's work without human intervention.

The experiment itself was elegantly designed. Both GPT-4 and Pron were tasked with generating movie titles and then writing synopses for all the titles generated. This ensured a symmetrical comparison, giving both contenders the same creative challenges. To evaluate the results, the researchers enlisted literary experts who used a rubric based on Boden's framework of creativity, assessing qualities like originality, attractiveness, and the distinct voice of the author.

The findings were revealing. Across all quality dimensions and in both English and Spanish, Patricio Pron consistently received significantly higher ratings. This suggests that while LLMs can produce grammatically correct and even engaging text, they still struggle to replicate the depth, nuance, and originality that characterize truly great creative writing.

Interestingly, the study also highlighted the importance of prompts in guiding LLM creativity. When GPT-4 wrote synopses based on titles provided by Pron, its performance, particularly in style and originality, significantly improved. This suggests that while LLMs may not yet be independent creative powerhouses, they can be valuable tools when guided by human ingenuity.

The study's findings offer a dose of reality amidst the hype surrounding AI. While LLMs have made impressive strides, they are not yet ready to replace human authors. The human spark of creativity, with its ability to weave compelling narratives, evoke emotions, and surprise readers with unexpected turns, remains a distinctly human trait. This is not to say that AI has no place in the creative process. As the study demonstrates, LLMs can be valuable partners, enhancing and augmenting human creativity. However, the role of the human author, with their unique perspective and mastery of language, remains secure, at least for now.Image
2/n Experiments and Noteworthy Results:

The paper conducts a two-stage experiment:

Stage 1: Title Generation:

Both GPT-4 and Patricio Pron were tasked with generating 30 movie titles each.

Stage 2: Synopsis Writing:

Both contenders wrote 600-word synopses for all 60 titles (their own and their opponent's).
GPT-4 was provided with a prompt that included information about Patricio Pron and emphasized the importance of creativity and literary value.

Evaluation:

Six literary experts (three for Spanish, three for English) assessed the synopses using a rubric based on Boden's framework of creativity, considering:
Attractiveness
Originality
Creativity
Critical Assessment
Own Voice (recognizable style)

Noteworthy Results:
Human Superiority: Patricio Pron consistently received significantly higher ratings across all quality dimensions in both Spanish and English, indicating that GPT-4, even in its advanced form, is not yet a match for a top human author in creative writing.

Prompt's Influence: GPT-4 performed significantly better when writing synopses based on titles provided by Patricio Pron, particularly in terms of style and originality. This highlights the importance of prompts in guiding LLM creativity.

Language Gap: GPT-4's creative writing was found to be stronger in English than in Spanish, suggesting a potential language bias in training data.

Recognizable Style: While GPT-4 was not explicitly constrained in terms of style, expert assessors were able to identify its writing with increasing accuracy over time, indicating the presence of detectable patterns in its output.Image
3/n Here is the full paper: arxiv.org/abs/2407.01119
Read 4 tweets
Aug 25
1/n How Agentic AI Can Learn Strategic Thinking Through Self-Improvement and Bi-Level Search

Large Language Models (LLMs) have demonstrated remarkable abilities in understanding and generating human-like text, but their capacity for strategic decision-making in complex environments has remained a challenge. This challenge is particularly evident in multi-agent games, where success hinges on anticipating and outmaneuvering opponents who are constantly adapting their own strategies. The "STRATEGIST" paper tackles this challenge head-on, proposing a novel framework that empowers LLMs to learn sophisticated strategic skills through a process of self-improvement and bi-level tree search.

Traditional approaches to LLM-based decision-making have often fallen short in these complex settings. Directly controlling actions with LLMs, while intuitive, becomes computationally infeasible as the number of possible actions explodes. Similarly, while LLM-based planning methods show promise, they often struggle to learn reusable strategies, instead focusing on planning at the individual action level. Reinforcement learning, while achieving superhuman performance in certain games, typically demands massive datasets and struggles to generalize across different domains.

STRATEGIST differentiates itself by focusing on the acquisition of high-level strategic skills rather than simply searching for the best action in every possible scenario. The framework centers around two key components:

High-Level Strategy Learning: Instead of directly selecting actions, the LLM learns to evaluate game states and generate effective dialogue strategies. This is achieved through:

Value Heuristics: The LLM learns functions that assess the favorability of different game states, allowing it to prioritize advantageous positions.
Dialogue Strategy Guides: Structured prompts guide the LLM in generating persuasive and strategically sound dialogue within the game, taking into account the social dynamics of the environment.

Low-Level Action Selection (MCTS):
To bridge the gap between strategic thinking and concrete actions, STRATEGIST employs Monte Carlo Tree Search (MCTS). This search method explores possible future game states, providing the LLM with more accurate estimates of state values and guiding it towards better immediate actions.

The learning process itself is driven by a continuous loop of self-play, reflection, and improvement. The LLM engages in simulated games, analyzes the outcomes to identify weaknesses in its strategies, and generates ideas for improvement. This reflective process is guided by examining key states where the LLM's predictions diverged from the actual simulation results. The most promising improvement ideas are then implemented, refining the LLM's value heuristics or dialogue guides.

The effectiveness of STRATEGIST is demonstrated through experiments on two distinct games: the strategic card game GOPS and the social deduction game Avalon. In both settings, STRATEGIST consistently outperforms baseline methods, showcasing the power of combining high-level strategy learning with low-level action planning. The results highlight the importance of both components, as removing either significantly diminishes performance.

The paper's findings offer compelling evidence for the potential of STRATEGIST to enhance LLM-based decision-making in complex, multi-agent environments. The framework's ability to learn generalizable strategic skills through self-improvement and search paves the way for LLMs to tackle increasingly sophisticated challenges in domains ranging from game playing to real-world strategic interactions. As LLMs continue to evolve, frameworks like STRATEGIST will be crucial in unlocking their full potential for strategic thinking and decision-making in our increasingly complex world.Image
2/n Comparision Other Methods

Direct LLM Control (e.g., SayCan, ReAct): These approaches directly use LLMs to select actions in a given state by prompting them with the current context.
Contrast: STRATEGIST argues that this is inefficient for complex games due to the vast action space. Instead, it advocates for learning higher-level strategic skills that guide action selection.

LLM-based Planning (e.g., Tree of Thoughts): These methods use LLMs to generate and reason over possible action sequences, often using tree search algorithms.
Contrast: While STRATEGIST also uses tree search (MCTS), it primarily focuses on learning reusable strategic skills (value heuristics, dialogue guides) rather than planning at the individual action level.

Reinforcement Learning (RL) for Games (e.g., AlphaGo, AlphaZero): RL methods have achieved superhuman performance in games, but they typically require massive amounts of training data and are often domain-specific.
Contrast: STRATEGIST leverages LLMs' existing world knowledge and reasoning abilities to learn effective strategies with less data. It also aims for more generalizable skills that can transfer across similar game environments.
3/n STRATEGIST Algorithm

STRATEGIST operates in a continuous loop of self-play, reflection, and improvement, aiming to learn effective strategies for multi-agent games. The algorithm consists of two primary levels:

1. High-Level Strategy Learning

This level focuses on learning reusable strategic skills that guide decision-making in the game. The two main types of skills are:

Value Heuristics: Functions that estimate the "goodness" or favorability of a given game state for the agent. A well-trained value heuristic allows the agent to prioritize actions leading to more advantageous positions.

Dialogue Strategy Guides: In games involving communication, these guides provide structured prompts to help the LLM generate effective and strategically sound dialogue. This could involve asking pertinent questions, providing misleading information (if deception is part of the game), or coordinating actions with teammates.

The high-level learning loop works as follows:

1. Initialization:
* The LLM is initialized with some basic strategies. This could involve:
* Randomly initialized value heuristics (e.g., a neural network that takes the game state as input and outputs a score).
* Simple rule-based dialogue guides (e.g., "If you have a good hand, act confident").
2. Self-Play Simulations:
* The LLM, equipped with its current strategies, plays multiple games against itself or other agents (potentially using different strategies).
* These simulations can be performed using Monte Carlo Tree Search (MCTS) for action selection (explained in the next section).
3. Reflection and Idea Generation:
* The LLM analyzes the outcomes of the simulations, focusing on:
* Identifying key states where its predictions (based on its value heuristics) differed significantly from the actual simulation results.
* Analyzing dialogue exchanges to identify missed opportunities or ineffective communication strategies.
* Based on this analysis, the LLM generates "improvement ideas" for its strategies. These could involve:
* Adjusting the weights of its value heuristic neural network.
* Adding new rules or refining existing ones in its dialogue strategy guide.
4. Strategy Improvement:
* The LLM selects the most promising improvement ideas (potentially using a bandit algorithm like UCB to balance exploration and exploitation).
* It implements these ideas, updating its value heuristics or dialogue guides accordingly.
5. Repeat:
* The process loops back to step 2, with the LLM now using its improved strategies in the next round of self-play simulations.

2. Low-Level Action Selection (MCTS)

While the high-level loop focuses on learning general strategies, MCTS provides an efficient way to select the best action in a specific game state, given the current strategies.
How MCTS works:

1. Tree Search: Starting from the current game state, MCTS builds a search tree by simulating possible future game states. Each node in the tree represents a game state, and each edge represents an action.
2. Rollouts: For each new node (state) reached during the search, MCTS performs multiple "rollouts" - simulated games played out until completion using a simple, often random, policy.
3. Backpropagation: The results of the rollouts are used to update the value estimates of the nodes in the search tree. Nodes that led to more wins for the agent will have their values increased.
4. Action Selection: After a certain number of simulations, MCTS selects the action leading to the child node with the highest value estimate from the current state.

Prerequisite Training
LLM Pre-training: STRATEGIST assumes the LLM has undergone extensive pre-training on a massive text dataset. This pre-training equips the LLM with:Strong language understanding and generation capabilities.
A broad base of world knowledge that can be leveraged for reasoning about the game.

No Fine-tuning Required: Importantly, STRATEGIST does not require any game-specific fine-tuning of the LLM. The learning process relies entirely on the LLM's pre-trained knowledge and its ability to learn from self-play and reflection.

Key Advantages of STRATEGIST:
Leverages Existing LLM Knowledge: STRATEGIST avoids the need for massive game-specific datasets by tapping into the LLM's pre-trained knowledge.

Learns Generalizable Strategies: The focus on high-level strategic skills promotes generalization, enabling the LLM to adapt to different opponents and even transfer knowledge to similar games.

Human-Interpretable Learning: The process of reflection and idea generation provides insights into the LLM's strategic thinking, making it easier to understand and potentially debug its decision-making process.
Read 4 tweets
Aug 18
1/n How Understanding Stateful Tools Advances Agentic AI

The rapid advancement of Large Language Models (LLMs) has ignited a wave of excitement and research into their potential for interacting with and manipulating the world around them. Imagine LLMs not just as eloquent conversationalists, but as capable agents, utilizing tools to complete tasks, answer questions, and even control physical systems. This exciting prospect, however, hinges on our ability to accurately evaluate and understand their tool-use capabilities. This is where existing benchmarks fall short, struggling to capture the nuances of real-world scenarios. The paper
from Apple "TOOLSANDBOX: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities" directly addresses this pain point, introducing a novel benchmark that pushes the boundaries of LLM evaluation.

Previous benchmarks, while valuable, often simplified the evaluation process. They primarily focused on stateless tools, neglecting the complexities of mutable world states. Single-turn interactions were the norm, failing to capture the dynamic back-and-forth of natural conversations. This is where TOOLSANDBOX diverges. It embraces the complexity of real-world tool use by incorporating stateful tools that interact with a dynamic world state. This allows researchers to assess an LLM's ability to understand, track, and manipulate this state to achieve its goals.

Furthermore, TOOLSANDBOX moves beyond static, single-turn interactions by introducing an LLM-based user simulator. This simulator, enhanced by "Knowledge Boundary" and "Demonstration" prompting techniques, enables realistic, multi-turn conversations, pushing LLMs to comprehend implicit information and adapt to evolving dialogues. This on-policy evaluation, where the LLM's actions directly influence the interaction, provides a more accurate representation of its true capabilities.

The experiments conducted using TOOLSANDBOX yielded fascinating insights. While proprietary models like OpenAI's GPT-4 and Anthropic's Claude variants demonstrated impressive performance, highlighting their advanced reasoning and state-tracking abilities, open-source models lagged significantly. This performance gap underscores the ongoing challenges in developing truly capable open-source alternatives.

The experiments also revealed critical areas for improvement. LLMs, particularly open-source models, struggled with managing and reasoning about the world state and effectively utilizing information from tool responses. This highlights the need for further research in state management, tool representation, and information integration.

The introduction of TOOLSANDBOX marks a significant step forward in LLM evaluation. By embracing statefulness, conversation, and interactivity, it provides a more realistic and comprehensive assessment of LLM tool-use capabilities. As we venture further into the era of tool-wielding LLMs, robust benchmarks like TOOLSANDBOX will be essential for tracking progress, identifying limitations, and ultimately, unlocking the full potential of these powerful technologies.Image
2/n The paper describes experiments conducted using TOOLSANDBOX to evaluate both open-source and proprietary LLMs across a variety of tool-use scenarios. Here's a breakdown of the experiments and noteworthy results:

Experiments:

Test Scenarios: 1032 human-authored test cases designed to cover diverse and challenging tool-use scenarios. These scenarios were categorized based on:
* Number of tool calls and user turns required.
* Presence of state dependencies between tools.
* Need for canonicalization (resolving ambiguous information).
* Handling of insufficient information (avoiding hallucination).

Models Evaluated: Both open-source and proprietary LLMs were evaluated, including:OpenAI's GPT-3.5-turbo and GPT-4.
Anthropics' Claude-instant-v1 and Claude-v1.3.
Several open-source models.

Metrics:
Milestone Achievement: Measures how well the agent completes the critical steps defined by the Milestones.
Minefield Avoidance: Evaluates the agent's ability to avoid incorrect or undesirable actions.
Turn Count: Tracks the efficiency of the agent in completing the task.

Noteworthy Performance Results:
Significant Gap Between Open-Source and Proprietary Models: Open-source models exhibited significantly lower performance compared to proprietary models (GPT-4 and Claude variants) across all scenario categories. This highlights the considerable gap that still exists in capabilities.
GPT-4's Superior Performance: GPT-4 consistently outperformed other models, demonstrating advanced reasoning, state tracking, and conversational abilities in complex tool-use scenarios.
Strong Performance of Claude Models: Claude models, particularly Claude-v1.3, also showed strong performance, indicating their competence in tool-assisted settings. However, Claude-instant-v1 lagged in scenarios involving complex state dependencies.
Challenges in State Management and Tool-Response Consumption: Open-source models particularly struggled with managing and reasoning about the world state, as well as effectively utilizing information from tool responses.
Impact of Tool Augmentations: Ablation studies showed that increasing distractions (irrelevant tools) and reducing tool information (uninformative names, missing descriptions) significantly impacted the performance of all models. This emphasizes the importance of clear and concise tool representations for effective tool use.
Importance of User Simulator Prompting: Experiments with different user simulator prompting strategies demonstrated that incorporating Knowledge Boundary and Demonstration significantly improved the realism and robustness of the simulated user, leading to more accurate evaluations.

Overall, the experiments conducted using TOOLSANDBOX provide valuable insights into the capabilities and limitations of current LLMs in tool-assisted settings. The results highlight the c
, setting the stage for future research and development in this critical area.Image
3/n Related Work

BFCL (Berkeley Function Calling Leaderboard):
Contrast with TOOLSANDBOX:
Stateless tools: BFCL primarily focuses on evaluating LLMs with stateless web services (RESTful APIs), while TOOLSANDBOX incorporates stateful tools and a mutable world state.
Single-turn evaluation: BFCL relies on single-turn user queries, whereas TOOLSANDBOX supports multi-turn conversational interactions with an LLM-based user simulator.
Predefined trajectory: BFCL evaluates against a fixed set of predefined trajectories, limiting the ability to assess the agent's own policy. TOOLSANDBOX allows for dynamic, on-policy trajectory collection.

ToolEval:
Contrast with TOOLSANDBOX:
Stateless tools: Similar to BFCL, ToolEval primarily uses stateless tools, lacking the stateful aspect of TOOLSANDBOX.
Limited conversation evaluation: While ToolEval allows for multiple rounds of tool interaction, it doesn't have a dedicated user simulator for realistic conversational evaluation.
LLM-based evaluation: ToolEval relies on an LLM evaluator to judge the final outcome, raising concerns about reliability and interpretability. TOOLSANDBOX employs a more objective evaluation based on Milestones and Minefields.

API-Bank:
Contrast with TOOLSANDBOX:
Limited state dependency exploration: While API-Bank includes some state-modifying tools, it doesn't explicitly focus on evaluating the agent's ability to handle state dependencies.
Off-policy evaluation: API-Bank evaluates on predefined, off-policy dialog trajectories, unlike TOOLSANDBOX's on-policy approach.
Static evaluation: API-Bank relies on static, turn-wise evaluation metrics based on predefined trajectories. TOOLSANDBOX allows for more flexible and dynamic evaluation using Milestones and Minefields.
Read 4 tweets
Aug 16
1/n Show, Don't Tell: Low Cost Personalized Large Language Models

Large language models (LLMs) have revolutionized our interaction with technology, showcasing remarkable abilities in understanding and generating human-like text. However, their training on massive, general-purpose datasets often leads to outputs that lack the personal touch, failing to capture the nuances of individual writing styles and task-specific requirements. While powerful, these LLMs can feel like generic one-size-fits-all tools, struggling to adapt to the diverse needs of individual users.

Addressing this critical gap between powerful LLMs and personalized language generation is the core focus of the paper "Show, Don't Tell: Aligning Language Models with Demonstrated Feedback." The authors introduce DITTO (Demonstration ITerated Task Optimization), a method that deviates from the data-heavy approaches of the past, instead empowering users to efficiently customize LLMs using a handful of demonstrations.

Traditional LLM alignment techniques, such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), rely on vast datasets of labeled examples or preferences. While effective, these methods are impractical for individual users who cannot afford to generate such large amounts of data. Prompting, while data-efficient, often becomes a tedious guessing game, requiring careful crafting of input phrases to steer the LLM towards desired outputs. Other approaches, like Constitutional AI, rely on pre-defined principles that may not capture the nuances of individual preferences.

DITTO breaks free from these limitations by leveraging the LLM itself to generate comparison data from a small set of user demonstrations. Instead of telling the model what to do through complex instructions or thousands of examples, DITTO allows users to show the desired behavior directly. This direct alignment with demonstrations provides a more intuitive and efficient way of communicating preferences to the model.

The paper demonstrates the effectiveness of DITTO through a series of compelling experiments. In automatic evaluations on benchmark datasets of author-specific writing, DITTO consistently outperforms existing methods, including SFT, few-shot prompting, and even self-play methods like SPIN. Furthermore, a user study on email writing showcases DITTO's ability to adapt to real-world scenarios, outperforming not only standard baselines but also user-constructed prompts. This highlights the advantage of learning directly from demonstrations rather than relying on users to articulate their preferences through potentially ambiguous prompts.

Perhaps the most striking finding is DITTO's remarkable sample efficiency. Compared to traditional preference-based methods, DITTO achieves comparable performance with an order of magnitude fewer feedback samples. This makes it a practical solution for individual users who can now customize LLMs with just a handful of examples.

In conclusion, DITTO marks a significant step towards a new era of personalized language models. By shifting from "telling" to "showing," it empowers users to mold powerful LLMs to their specific needs and preferences. This opens up exciting possibilities for a future where LLMs are no longer generic tools but personalized assistants that can adapt to the unique voice and tasks of each individual.Image
2/n Comparison with other approaches

1. Supervised Fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF):

Prior Work: These methods train LLMs on large datasets of human-labeled text or preferences.
DITTO Contrast: DITTO is significantly more data-efficient, requiring only a handful of demonstrations instead of thousands of examples. It achieves this by leveraging the LLM itself to generate comparison data.

2. Prompting:

Prior Work: Prompting involves crafting specific input phrases to guide the LLM's output.
DITTO Contrast: While prompting can be data-efficient, it often requires tedious trial-and-error to find effective prompts. DITTO provides a more direct and intuitive way of aligning the model by learning from demonstrations rather than relying on prompt engineering.

3. Constitutional AI:

Prior Work: This method automatically generates preference data using the LLM itself, guided by pre-defined principles.
DITTO Contrast: DITTO does not rely on pre-defined principles, making it more flexible and adaptable to individual preferences. It directly learns from user demonstrations, capturing more nuanced aspects of desired behavior.

4. Group Preference Optimization (GPO):

Prior Work: GPO aims for few-shot alignment by meta-learning preference groups from a large dataset.
DITTO Contrast: DITTO does not require a large pre-existing dataset for meta-learning. It focuses on individual user adaptation and can learn directly from a small number of demonstrations provided by that user.

5. Self-Play Methods (e.g., SPIN):

Prior Work: These methods improve LLMs through iterative self-play, often using a stronger language model as a critic.
DITTO Contrast: DITTO is designed for data-limited scenarios and does not require an external critic or a large number of demonstrations. It focuses on aligning with specific user preferences rather than achieving general self-improvement.

6. Online Imitation Learning:

Prior Work: Traditional online imitation learning methods typically focus on continuous control tasks and often require explicit reward function learning.
DITTO Contrast: DITTO adapts online imitation learning principles to the discrete text generation setting of LLMs. It implicitly learns a reward function from demonstrations and efficiently generates comparison data online.
3/n Here's a breakdown of the DITTO approach with examples:

1. Start with Demonstrations:

The user provides a few (typically less than 10) demonstrations of the desired output for a specific task or writing style.
Example: Imagine you want to train an LLM to write emails in a concise and informal style. You provide the following demonstrations:
Demonstration 1 (Original): "Dear Professor Smith, I hope this email finds you well. I am writing to inquire about the possibility of scheduling a meeting to discuss my research proposal."
Demonstration 1 (Edited): "Hi Prof. Smith, Hope you're doing well! Wondering if you're free to chat about my research proposal sometime next week."
Demonstration 2 (Original): "I would be grateful if you could please provide me with feedback on the attached document at your earliest convenience."
Demonstration 2 (Edited): "Let me know what you think of the attached doc when you get a chance!"

2. Generate Comparisons:
DITTO treats user demonstrations as the "expert" and compares them to outputs generated by the LLM.
The key insight is that the LLM's own outputs, even if imperfect, can be used as valuable training data when contrasted with the demonstrations.

Example: The LLM might generate the following output for a new email: "Dear Dr. Jones, I am writing to request your availability for a meeting..."DITTO would identify that this output is less concise and informal than the user's demonstrations.

3. Rank Outputs and Create a Preference Dataset:

DITTO doesn't just rely on pairwise comparisons (demonstration vs. LLM output). It leverages outputs from all training iterations, creating a richer preference dataset.
"Replay" Comparisons: Compare current LLM outputs to past demonstrations.
"Intermodel" Comparisons: Compare outputs from different training iterations to identify improvements.
Example:DITTO might compare the current output ("Dear Dr. Jones...") with a previous iteration that generated an even more formal email ("To the esteemed Dr. Jones...").
This comparison highlights progress and helps the model learn to move further towards the desired style.

4. Iterative Training with DPO:

DITTO uses the ranked preference dataset to iteratively train the LLM using an alignment algorithm like DPO (Direct Preference Optimization).
DPO updates the LLM's parameters to generate outputs that are more likely to be preferred based on the demonstrated rankings.
Example: Through iterative training, the LLM learns to favor concise language, informal greetings, and directness in its email writing style.
In essence, DITTO guides the LLM towards the desired behavior by:
Learning from mistakes: Contrasting its own outputs with user demonstrations to identify areas for improvement.
Building on progress: Leveraging outputs from previous iterations to reinforce positive changes and avoid repeating mistakes.
Iteratively refining its understanding: Continuously updating its internal representations based on the evolving preference dataset.

By effectively leveraging the LLM's own generations as a source of learning, DITTO offers a data-efficient and intuitive approach for aligning these powerful models with individual preferences and tasks.
Read 4 tweets
Aug 13
1/n OpenDevin's Radical Approach to Agentic AI

The rapid advancement of large language models (LLMs) has ushered in a new era of AI agents capable of interacting with and impacting their environments in increasingly sophisticated ways. However, developing and evaluating these agents for complex, real-world tasks presents significant challenges. Existing frameworks often struggle to provide the necessary tools, environments, and interfaces for building truly versatile and robust AI agents. The OpenDevin platform, as presented in the paper "OpenDevin: An Open Platform for AI Software Developers as Generalist Agents," directly addresses these limitations, offering a novel approach that empowers AI agents to interact with the world more like human software developers – through code, command lines, and web browsing.

One of the key pain points OpenDevin tackles is the inherent complexity of developing and evaluating advanced AI agents. Traditional frameworks often rely on simplified environments and limited action spaces, hindering the development of agents capable of tackling real-world tasks. OpenDevin breaks free from these constraints by providing a realistic environment that includes a sandboxed Linux operating system and a fully functional web browser. This allows agents to interact with real-world tools and data sources, enabling them to tackle more meaningful and impactful challenges. Moreover, OpenDevin's standardized evaluation framework, encompassing a diverse set of established benchmarks, ensures consistent and comprehensive assessment of agent capabilities across various domains.

Another significant limitation addressed by OpenDevin is the lack of a standardized and powerful interface for agent-world interaction. While some frameworks rely on pre-defined tool sets or JSON-based function calls, OpenDevin embraces code execution and web browsing as its primary interaction mechanisms. This allows agents to leverage the flexibility and expressiveness of programming languages, breaking free from the limitations of rigid action spaces and enabling them to solve complex problems in a more human-like manner.

Recognizing the importance of reusable components in software development, OpenDevin introduces the AgentSkills library – a centralized and extensible collection of tools for common agent tasks. This modular design simplifies the development process and encourages community contributions, fostering a collaborative ecosystem for building and sharing specialized agent capabilities. Furthermore, OpenDevin tackles the challenge of multi-agent collaboration by incorporating a delegation mechanism. This allows developers to create teams of specialized agents, each excelling in specific domains, to work together and solve complex problems more effectively.

The effectiveness of OpenDevin's approach is evident in its experimental results. Evaluated on 15 established benchmarks spanning software engineering, web browsing, and general assistance tasks, OpenDevin agents demonstrate strong and competitive performance across the board. The agents excel in tasks like code generation, web navigation, information extraction, and problem-solving, highlighting the platform's versatility and the power of its core design principles.

In conclusion, OpenDevin represents a significant leap forward in AI agent development. By providing a realistic environment, a powerful and flexible interface, an extensible skill library, and support for multi-agent collaboration, OpenDevin empowers researchers and developers to create more capable, versatile, and robust AI agents. The platform's promising experimental results and its community-driven approach pave the way for a future where AI agents seamlessly integrate into our world, assisting us in tackling complex challenges and pushing the boundaries of what's possible with artificial intelligence.Image
2/n Comparison with Other Systems

1. AutoGPT, LangChain, MetaGPT, AutoGen, Agents, Xagents, OpenAgents, GPTSwarm:

Category: These are general-purpose AI agent frameworks, often focused on chaining together various tools and APIs to accomplish tasks.
Contrast with OpenDevin: While these frameworks offer flexibility in tool integration, they often lack a standardized and powerful interface for interacting with the world. They may rely on pre-defined tool sets or JSON-based function calls, which can limit agent capabilities and generalization. OpenDevin, on the other hand, empowers agents to interact with the world more directly through code execution and web browsing, providing greater flexibility and expressiveness. Additionally, OpenDevin places a strong emphasis on a sandboxed environment, agent skill library, and systematic evaluation, which are not always central to these other frameworks.

2. AutoCodeRover, SWE-Agent:

Category: These frameworks are specifically designed for software engineering tasks, enabling agents to write, debug, and test code.
Contrast with OpenDevin: While these frameworks excel in software development domains, OpenDevin aims to be more general-purpose. It includes software development capabilities but also extends to web browsing and other tasks through its flexible interface and agent skill library. OpenDevin also emphasizes multi-agent collaboration, which is not a primary focus in these more specialized frameworks.

3. BabyAGI, AgentVerse:

Category: These frameworks focus on building autonomous agents that can manage and execute tasks over extended periods, often with minimal human intervention.
Contrast with OpenDevin: While OpenDevin supports autonomous agent behavior, it also emphasizes human-in-the-loop scenarios and provides tools for interactive agent development and debugging. OpenDevin's focus on a realistic environment and standardized evaluation also sets it apart from these frameworks, which may rely on more simplified task representations or simulations.

4. ReAct, Toolformer:

Category: These are research efforts focusing on specific techniques for enhancing agent capabilities, such as reasoning with actions (ReAct) or learning to use tools (Toolformer).
Contrast with OpenDevin: OpenDevin is a platform that can incorporate and benefit from these research advancements. It provides a framework where techniques like ReAct or Toolformer can be implemented and evaluated within a broader context of agent development and real-world interaction.

In summary:

OpenDevin distinguishes itself from prior work by combining the following features:

Powerful and flexible interface based on code execution and web browsing.
Realistic environment with a sandboxed operating system and web browser.
Extensible library of agent skills and tools.
Support for multi-agent collaboration through delegation.
Standardized evaluation framework with diverse benchmarks.

These features address the limitations of existing frameworks and pave the way for developing more capable, versatile, and reliable AI agents that can effectively interact with and solve real-world problems.
3/n Key design motivations

Complexity of Development and Evaluation: OpenDevin provides a structured framework with a clear separation of concerns (agent logic, environment, skills library) that simplifies both development and evaluation of complex AI agents. Its standardized evaluation framework and diverse benchmarks allow for consistent and comprehensive assessment of agent capabilities.

Limited Real-World Interaction: OpenDevin grants agents access to a sandboxed Linux environment and a fully functional web browser. This enables interaction with real-world tools and data sources, allowing agents to tackle more realistic and complex tasks beyond simulated environments.

Lack of a Standardized Interface: OpenDevin introduces a powerful and consistent interface based on code execution and web browsing actions. This allows agents to leverage the flexibility and expressiveness of programming languages to interact with the environment, breaking free from the limitations of pre-defined action spaces.

Difficulty in Creating and Maintaining Tools: OpenDevin's AgentSkills library provides a centralized and extensible collection of reusable tools for common agent tasks. This modular design, coupled with rigorous testing, makes it easier for the community to contribute, maintain, and share specialized tools across different agent implementations.

Limited Multi-Agent Collaboration: OpenDevin incorporates multi-agent delegation, allowing developers to create teams of specialized agents that collaborate to solve complex problems. This enables breaking down tasks into smaller, more manageable sub-tasks, leveraging the strengths of different agent architectures and skillsets.

By addressing these pain points, OpenDevin aims to accelerate research and development of more capable, versatile, and robust AI agents that can effectively interact with and solve real-world problems. Its community-driven approach fosters collaboration and innovation, paving the way for the next generation of general-purpose AI agents.
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(