Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Eden Marco

@EdenEmarco177

Jun 17, 2023 • 15 tweets • 7 min read • Read on X

Scrolly

@LangChainAI

1/14🧵Real world CHUNKING best practices thread:
🔍 A common question I get is: "How should I chunk my data and what's the best chunk size?" Here's my opinion based on my experience with @LangChainAI 🦜🔗and building production grade GenAI applications.

2/14 Chunking is the process of splitting long pieces of text into smaller, hopefully semantically meaningful chunks. It's essential when dealing with large text inputs, as LLMs often have limitations on the amount of tokens that can be processed at once. (4k,8k,16k,100k)

@pinecone

3/14 Eventually, we store all chunks in a vectorstore like @pinecone🌲 and perform similarity search on them then using the results as context to the LLM.

4/14 This approach, known as in-context learning or RAG (Retrieval-Augmented Generation), helps the language model answer with contextual understanding. 🧩🔎(check my thread on RAG)

@LangChainAI

5/14 Ideally, we want to keep semantically related pieces of data together when chunking. In @LangChainAI🦜🔗 , we use TextSplitters for chunking.

@LangChainAI

6/14 We need to specify to the @LangChainAI TextSplitters how we want to split the text and create the chunks. We can define the chunk size as well as the option for chunk overlap, although personally, I don't often utilize the chunk overlap feature.

7/14 The most effective strategy I've found is chunking by the existing document formatting.
If we are chunking python files and wikipedia text files we ought to chunk them differently according to their file type.

8/14 Example: In Python, a good separator for chunking can be '\ndef' to represent a function. It's considered best practice to keep functions short, typically no longer than 20 lines of code (unless, of course, you're a Data Scientist and have a knack for longer functions 😂

9/14 So here the chunk size of 300 can be a good heuristic IMO.

Remember there is no silver bullet☑️ and you MUST benchmark everything you do to get optimal results.

@LangChainAI

10/14 An advantage of @LangChainAI 🦜🔗 text splitters is our ability to create dynamically optimized splitters based on needs so we have full flexibility here

@hwchase17

11/14 However Imagine having a ready to go- text splitter specifically tailored to you file extension: .md, .html, or .py files.
@hwchase17 and @LangChainAI 🦜🔗 team, please consider implementing this!) This can saves us lazy devs tons of time with a "best practice" built in.

12/14 Rule of thumb👍: When determining the chunk size --> balance. Size should be small enough to ensure effective processing by the LLM, while also being long enough to provide humans with a clear understanding of the semantic meaning within each chunk.

13/15 For text files I found that 500 works well.
When chunking is done correctly, it greatly improves information retrieval. Remember to consider the type of file you're working with when chunking. Each file format requires a different set of rules for optimal chunking.

@LangChainAI

14/15 I teach @LangChainAI 🦜🔗 elaborately in my @udemy course with almost 5k students and 630+ reviews
udemy.com/course/langcha…

Twitter only limited discount:
TWITTER9DCC71C67A9AA

@LangChainAI

15/15 What are your best @LangChainAI🦜🔗 chunking strategies?

would love to hear your thought😎😎😎
@pinecone 🌲 Would love to hear your take on this as well.

#ENDOFTHREAD🧵🧵🧵

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @EdenEmarco177

Eden Marco

@EdenEmarco177

Jun 30, 2023

1/17🧵Demystifying LLM memory🧠 mega thread featuring @LangChainAI 🦜🔗
In this thread I will cover the most popular real-world approaches for integrating memory to our GenAI applications 🤖

2/17 THE GIST:
Memory is basically using in context learning. Its just passing extra context of our conversation/relevant parts of it to the LLM in addition to our query. We augment our prompt with history giving the LLM ad-hoc memory-like abilities such as coreference resolution

Coreference resolution:
When someone says "@hwchase17 just tweeted. He wrote about @LangChainAI ," we effortlessly understand that "he" refers to @hwchase17 based on our coreference resolution skills. It's a cognitive process that enables effective communication & understanding

Read 19 tweets

Eden Marco

@EdenEmarco177

Jun 23, 2023

0/12 📢🧵Unpopular Opinion thread - Vectorstores are here to stay! 🔐🚀

I've noticed a lot of tweets lately discussing how #LLM s with larger context windows will make vector-databases obsolete. However, I respectfully disagree. Here's why:

1/12 @LangChainAI 🦜🔗 @pinecone 🌲 @weaviate_io @elastic @Redisinc @milvusio let me know what you think😎 I think you will like this.

2/12: Too much context hurts performance. As the context window expands, #LLM s can "forget" information from the beginning of the prompt. With contexts larger than ~50k tokens, this becomes a challenge.

Read 13 tweets

Eden Marco

@EdenEmarco177

Jun 10, 2023

@LangChainAI

1/13 🧵💡 Ever wondered how to handle token limitations of LLMs? Here's one strategy of the "map-reduce" technique implemented in @LangChainAI 🦜🔗
Let's deep dive! @hwchase17 's your PR is under review again😎

@Google

2/13 MapReduce is not new. Famously introduced by @Google , it's a programming model that allows for the processing and generation of large data sets with a parallel, distributed algorithm.

3/13 In essence, it divides work into small parts that can be done simultaneously (the “mapping”) and then merge the intermediate results back to a one final result (“reducing”).

Read 13 tweets

Eden Marco

@EdenEmarco177

Jun 8, 2023

@LangChainAI

1/8 🚀 Let's go step by step on "Chat with your Repo" assistant powered by @LangChainAI🦜🔗 and @pinecone🌲all running smoothly on @googlecloud☁️ Run- this was demoed at yesterday's HUGE @googlecloud @pinecone event in Tel Aviv 🇮🇱

@hwchase17 counting on you for next time😎

@googlecloud

2/8 Step 1? Vectorize your repository files. With using @googlecloud VertexAI embeddings and a couple of lines of @LangChainAI you simply ingest these vectors into @pinecone vectorstore.

@googlecloud

3/8 Now, we use @googlecloud VertexAI embeddings along with context retrieved from @pinecone to augment the user's original prompt to @googlecloud PaLM 2 LLM. This enables is also called in context learning. With @LangChainAI again is just a couple of lines of code

Read 9 tweets

Eden Marco

@EdenEmarco177

Jun 5, 2023

@LangChainAI

1/6🌐💡Singularity is here? Just read this blog from @LangChainAI 🦜🔗 featuring @itstimconnors on multi-agent simulation. IMO its amazing to witness how a few "hacks" such as a memory system + some prompt engineering can stimulate human-like behavior 🤖

@Stanford

2/6 inspired by @Stanford 's "Generative Agents" paper-
Every agent in a GPTeam simulation has its unique personality, memories, and directives, creating human-like behavior👥

3/6 📚💬 "The appearance of an agentic human-like entity is an illusion. Created by a memory system and a fe of distinct Language Model prompts."- from GPTeam blog. This ad-hoc human behaviour is mind blowing🤯🤯🤯

Read 6 tweets

Eden Marco

@EdenEmarco177

Jun 3, 2023

@LangChainAI

🧵We all spend too much time scouring LinkedIn/ Twitter before meeting someone new🕵🏽
So, here comes Ice Breaker LLM agent app. Just input a name, it fetches social media to provide a concise summary, interesting facts and a fun icebreaker!
Build on @LangChainAI🦜 & @pinecone🌲 twitter.com/i/web/status/1…

@udemy

1/7 In just one weekend, this journey I created, shared on @udemy , has blown up in ways I didn’t expect🤖🚀

Teaching how easy it is creating cool & powerful LLM apps with @LangChainAI 🦜 🔗 + @pinecone 🌲, has gone viral 🚀

@udemy

2/7 Thousands of students, 450+ reviews⭐ , a @udemy best seller tag, and an inbox full of developers from leading companies now equipped and building GenAI solutions 🤖