Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Lance Martin

@RLanceMartin

May 30, 2023 • 7 tweets • 7 min read • Read on X

Scrolly

@MosaicML

There's a lot of questions abt smaller, open source LLMs vs larger, closed models for tasks like question answering. So, we added @MosaicML MPT-7B & @lmsysorg Vicuna-13b to @LangChainAI auto-evaluator. You test them on your own Q+A use-case ... autoevaluator.langchain.com/playground

@jefrankle

... great pod w/ @jefrankle @swyx @abhi_venigalla on MPT-7B, so used auto-evaluator to benchmark it on a test set of 5 Q+A pairs from the GPT3 paper. Results are close to larger models and it's very fast (kudos to @MosaicML inference team!) ...

https://twitter.com/karpathy/status/1660824101412548609?s=20

@sfgunslinger

... @sfgunslinger also deployed Vicuna-13b on @replicatehq and it achieves performance parity w/ the larger models on this test set. Prompt eng may further improve this (v helpful discussion w/ the folks at @replicatehq / @JoeEHoover); we are looking into improving latency ...

@AnthropicAI

... @AnthropicAI Claude-100k (discussed in more depth below) is very impressive for a "retriever-free" approach (pass the full 75 pg GPT3 pdf into the model's context window!), but costs higher latency ...

https://twitter.com/RLanceMartin/status/1658499575626465283?s=20

@swyx

... @swyx @transitive_bs and @simonw make the interesting point that smaller (open source) models as "reasoning engines" coupled to retrievers (to fetch relevant data) may be sufficient for many apps (Q+A, agents, etc). It's def worth testing them.

https://twitter.com/swyx/status/1654214894923960320?s=20

... all auto-evaluator code is open source here:
github.com/langchain-ai/a…

@LangChainAI

... and on app usage: in "Playground" simply upload any PDF doc. The app will use an LLM to auto-generate a Q+A eval set, run on a user-selected chain (model, retriever, etc) built w/ @LangChainAI, use a GPT4 to grade, and store each expt for you.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @RLanceMartin

Lance Martin

@RLanceMartin

May 11

self-verification (Outcomes) + self-learning (Dreaming) are two of the most interesting new features we shared at Code With Claude last week.

a few notes + video links to the talks ...

1/ Outcomes in Claude Managed Agents is just a "Ralph loop" to verify output vs a user provided rubric. it uses a grader sub-agent for the verification. some interesting points on the benefits of an isolated verifier here:
anthropic.com/engineering/ha…

2/ @jess__yan + i showed a toy example of Outcomes: i had an Managed Agent make a generative UI w/ metrics (charts, graphs) rendered as svg. i used an Outcomes loop to improve the render timing - Claude figured out various tricks (prompting, etc) to speed it up.

Read 7 tweets

Lance Martin

@RLanceMartin

Apr 10

i co-wrote the Anthropic engineering blog on Claude Managed Agents, and wanted to share some thoughts on agent harnesses + infrastructure for long-horizon tasks ... 🧵
anthropic.com/engineering/ma…

1/ Claude can perform tasks over increasingly long time-horizons. newer models (Mythos) will push this further. with Claude Managed Agents, we needed to design infrastructure that can support long-horizon work.

2/ we also needed a way to update the agent harness: each harness encodes assumptions about what Claude can't do. those assumptions go stale as Claude gets more capable, and can even hobble the model.

Read 11 tweets

Lance Martin

@RLanceMartin

Mar 22, 2024

Gave this short talk on RAG vs long context LLMs at a few meetups recently. Tries to pull together threads from a few recent projects + papers I really like.

Just put on YT, a few highlights w papers below ...

1/ Can long context LLMs retrieve & reason over multiple facts as a RAG system does? @GregKamradt and I dug into this w/ multi-needle-in-a-haystack on GPT4. Retrieval is not guaranteed: worse for more needles, worse at doc start, worse w/ reasoning.

2/ Nice paper (@adamlerer & @alex_peys) suggests this may be due to recency bias from training: recent tokens are typically most informative for predicting the next one. Not good for context augmented generation.
arxiv.org/pdf/2310.01427…

Read 11 tweets

Lance Martin

@RLanceMartin

Aug 25, 2023

Check out these new guides for 13 popular LLM use-cases. Part of a major community effort to improve the @LangChainAI docs + add CoLabs prototyping.

1/13: Open source LLMs
How to use many open source LLMs on your device
python.langchain.com/docs/guides/lo…

2/13: Agents
How to quickly test various types of agents
python.langchain.com/docs/use_cases…

3/13: RAG (retrieval augmented generation)
How to do RAG at multiple levels of abstraction
python.langchain.com/docs/use_cases…

Read 14 tweets

Lance Martin

@RLanceMartin

Aug 23, 2023

GPT-3.5 and LLaMA2 fine-tuning guides 🪄

Considering LLM fine-tuning? Here's two new CoLab guides for fine-tuning GPT-3.5 & LLaMA2 on your data using LangSmith for dataset management and eval. We also share our lessons learned in a blog post here:

blog.langchain.dev/using-langsmit…

... 1/ When to fine-tune? Fine-tuning is not advised for teaching an LLM new knowledge (see references from @OpenAI and others in our blog post). It's best for tasks (e.g., extraction) focused on "form, not facts":
anyscale.com/blog/fine-tuni…

https://twitter.com/RLanceMartin/status/1691880034058064365?s=20

... 2/ With this in mind, we fine-tuned LLaMA-7b-chat & GPT-3.5-turbo for knowledge graph triple extraction (see details in blog post and CoLab). Notebooks here:
LLaMA CoLab:
GPT-3.5-turbo CoLab:

colab.research.google.com/drive/1tpywvzw…
colab.research.google.com/drive/1YCyDHPS…

https://twitter.com/RLanceMartin/status/1691880034058064365?s=20

Read 9 tweets

Lance Martin

@RLanceMartin

Aug 12, 2023

Code understanding 🖥️🧠

LLMs excel at code analysis / completion (e.g., Co-Pilot, Code Interpreter, etc). Part 6 of our initiative to improve @LangChainAI docs covers code analysis, building on contributions of @cristobal_dev + others:
https://t.co/2DsxdjbYeypython.langchain.com/docs/use_cases…

https://twitter.com/karpathy/status/1608895189078380544?s=20

1/ Copilot and related tools (e.g., @codeiumdev) have dramatically accelerated dev productivity and shown that LLMs excel at code understanding / completion

https://twitter.com/karpathy/status/1608895189078380544?s=20

https://twitter.com/cristobal_dev/status/1675745319659659270?s=20

2/ But, RAG for QA/chat on codebases is challenging b/c text splitters may break up elements (e.g., fxns, classes) and fail to preserve context about which element each code chunk comes from.

https://twitter.com/cristobal_dev/status/1675745319659659270?s=20

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Lance Martin

Try unrolling a thread yourself!

More from @RLanceMartin

Lance Martin

Lance Martin

Lance Martin

Lance Martin

Lance Martin

Lance Martin

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!