All include LangSmith traces to visualize what is going on under the hood (chain information flow + prompts) + most include CoLab notebooks for prototyping.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
i co-wrote the Anthropic engineering blog on Claude Managed Agents, and wanted to share some thoughts on agent harnesses + infrastructure for long-horizon tasks ... 🧵 anthropic.com/engineering/ma…
1/ Claude can perform tasks over increasingly long time-horizons. newer models (Mythos) will push this further. with Claude Managed Agents, we needed to design infrastructure that can support long-horizon work.
2/ we also needed a way to update the agent harness: each harness encodes assumptions about what Claude can't do. those assumptions go stale as Claude gets more capable, and can even hobble the model.
Gave this short talk on RAG vs long context LLMs at a few meetups recently. Tries to pull together threads from a few recent projects + papers I really like.
Just put on YT, a few highlights w papers below ...
1/ Can long context LLMs retrieve & reason over multiple facts as a RAG system does? @GregKamradt and I dug into this w/ multi-needle-in-a-haystack on GPT4. Retrieval is not guaranteed: worse for more needles, worse at doc start, worse w/ reasoning.
2/ Nice paper (@adamlerer & @alex_peys) suggests this may be due to recency bias from training: recent tokens are typically most informative for predicting the next one. Not good for context augmented generation. arxiv.org/pdf/2310.01427…
Considering LLM fine-tuning? Here's two new CoLab guides for fine-tuning GPT-3.5 & LLaMA2 on your data using LangSmith for dataset management and eval. We also share our lessons learned in a blog post here:
... 1/ When to fine-tune? Fine-tuning is not advised for teaching an LLM new knowledge (see references from @OpenAI and others in our blog post). It's best for tasks (e.g., extraction) focused on "form, not facts": anyscale.com/blog/fine-tuni…
... 2/ With this in mind, we fine-tuned LLaMA-7b-chat & GPT-3.5-turbo for knowledge graph triple extraction (see details in blog post and CoLab). Notebooks here:
LLaMA CoLab:
GPT-3.5-turbo CoLab:
LLMs excel at code analysis / completion (e.g., Co-Pilot, Code Interpreter, etc). Part 6 of our initiative to improve @LangChainAI docs covers code analysis, building on contributions of @cristobal_dev + others:
https://t.co/2DsxdjbYeypython.langchain.com/docs/use_cases…
1/ Copilot and related tools (e.g., @codeiumdev) have dramatically accelerated dev productivity and shown that LLMs excel at code understanding / completion
2/ But, RAG for QA/chat on codebases is challenging b/c text splitters may break up elements (e.g., fxns, classes) and fail to preserve context about which element each code chunk comes from.
LLMs unlock a natural language interface with structured data. Part 4 of our initiative to improve @LangChainAI docs shows how to use LLMs to write / execute SQL queries w/ chains and agents. Thanks @manuelsoria_ for work on the docs:
https://t.co/CyOqp5I3TMpython.langchain.com/docs/use_cases…
1/ Text-to-SQL is an excellent LLM use-case: many ppl can describe what they want in natural language, but have difficultly mapping that to a specific SQL queries. LLMs can bridge this gap, e.g., see:
https://t.co/b0NMkHPe9xarxiv.org/pdf/2204.00498…
2/ create_sql_query_chain( ) maps from natural language to a SQL query: pass the question and the database into the chain, and get SQL out. Run the query on the database easily:
Getting structured LLM output is hard! Part 3 of our initiative to improve @LangChainAI docs covers this w/ functions and parsers (see @GoogleColab ntbk). Thanks to @fpingham for improving the docs on this:
2/ Functions (e.g., using OpenAI models) have been a great way to tackle this problem, as shown by the work of @jxnlco and others. LLM calls a function and returns output that follows a specified schema. wandb.ai/jxnlco/functio…