David McSweeney Profile picture
Dec 10 7 tweets 2 min read Read on X
Lots of chatter recently about ChatGPT's fan-out queries getting much longer (5->15 words on average).

Here's what I think is going on, and why it makes sense.
If you've built any kind of semantic search, you've probably hit this annoying problem:

Search "lawyer fees" -> get the about us page
Search "lawyer credentials" -> same about us page
Search "lawyer experience" -> go away about us page

One chunk to rule them all, one chunk to find them, one chunk to bring them all, and in the darkness bind them...
Short queries split their "weight" evenly across words.

"lawyer credentials" -> embedding is roughly half lawyer, half credentials.

The "lawyer" bit is dominating.

So what's the solution?
Add synonyms for the angle, but keep the topic mentioned once (or twice). It's still there, but it's diluted, and the embedding is pulled towards the intent.

For example, "lawyer qualifications credentials licensing certifications experience".

Now the embedding is weighted 20% towards "lawyers" and 80% towards the credentials cluster.

The vector points more specifically at what you actually want.
This isn't new. Query expansion is a well known thing.

But what's interesting is how they're doing it. Not by adding more topic keywords, but by using synonym clouds to steer the embedding direction while staying anchored to the subject.
The result: different intents/angles now surface different content. Content which better matches the actual intent.

Generic pages (or chunks within them) that mention everything score ok.

Specialized content that goes deep on one angle actually wins for that angle.

Better diversity, and more useful retrieval for RAG.
In summary: it's not random. And it's not keyword stuffing.

The fact the queries expanded quite a bit a few weeks back suggests to me they might be hitting an internal cache/index more frequently.

Will be explaining a bit more in a blog post soon.

Feel free to disagree with me of course :)

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with David McSweeney

David McSweeney Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(