Lots of chatter recently about ChatGPT's fan-out queries getting much longer (5->15 words on average).
Here's what I think is going on, and why it makes sense.
If you've built any kind of semantic search, you've probably hit this annoying problem:
Search "lawyer fees" -> get the about us page
Search "lawyer credentials" -> same about us page
Search "lawyer experience" -> go away about us page
One chunk to rule them all, one chunk to find them, one chunk to bring them all, and in the darkness bind them...
Short queries split their "weight" evenly across words.
"lawyer credentials" -> embedding is roughly half lawyer, half credentials.
The "lawyer" bit is dominating.
So what's the solution?
Add synonyms for the angle, but keep the topic mentioned once (or twice). It's still there, but it's diluted, and the embedding is pulled towards the intent.
For example, "lawyer qualifications credentials licensing certifications experience".
Now the embedding is weighted 20% towards "lawyers" and 80% towards the credentials cluster.
The vector points more specifically at what you actually want.
This isn't new. Query expansion is a well known thing.
But what's interesting is how they're doing it. Not by adding more topic keywords, but by using synonym clouds to steer the embedding direction while staying anchored to the subject.
The result: different intents/angles now surface different content. Content which better matches the actual intent.
Generic pages (or chunks within them) that mention everything score ok.
Specialized content that goes deep on one angle actually wins for that angle.
Better diversity, and more useful retrieval for RAG.
In summary: it's not random. And it's not keyword stuffing.
The fact the queries expanded quite a bit a few weeks back suggests to me they might be hitting an internal cache/index more frequently.
Will be explaining a bit more in a blog post soon.
Feel free to disagree with me of course :)
• • •
Missing some Tweet in this thread? You can try to
force a refresh