Each result is categorized by its respective domain.
4/ We use these ratings to filter and re-rank what results show in what bucket.
Categorizing domains is not a perfect science, so we make sure to show results at most one bucket away from the selection on the slider.
5/ With this bucketing implementation, we need to ensure that we have domains to serve from all of these perspectives.
We collected a variety of domains from each of the 5 buckets, and pulled the respective sitemaps.
These sitemaps are fed into our crawl pipeline.
6/ Previously, crawling and indexing one URL into Neeva’s own index took more than 2 weeks after the URL’s discovery.
Apparently, this is too stale in terms of serving news pages.
7/ In order to serve news pages in a fresh way, we build our fresh crawl-indexing pipeline.
Every hour, we crawl and index URLs from a couple of sources, including:
📌 Sitemaps
📌 Twitter feeds
📌 API crawl, etc.
From there we fast-track these pages into our Koala indexing. 🐨
8/ To utilize Bias Buster to its full potential, we implemented triggering logic.
This allows the slider to show if there are a variety of results to view on the spectrum.
We determined these queries are typically ones that have a high news intent, as well as political intent.
9/ So, we:
1️⃣ Probe the result sets pulled from the buckets to gain intuition on variety
2️⃣ Check political intent & topicality
3️⃣ Check if the query has any identified intents for which we shouldn't trigger on
10/ Here's an example...
If the query includes a site restrict, we wouldn't want to display Bias Buster, since the ultimate intent is to see results from that site.
11/ Overall, Bias Buster gives an opportunity for our US users to explore different perspectives on the political spectrum when available.
Head over to neeva.com to try it out and let us know what you think!
• • •
Missing some Tweet in this thread? You can try to
force a refresh
We're excited to share we added #NeevaAI:
✅ Answer support for verified health sites, official programming sites, blogs, etc.
✅ Availability in the News tab
🧵
First, at @Neeva we're passionate about generative search engines combining the best of search & AI.
But it's clear generative AI systems have no notion of sources or authority.
Their content is based on their reading of source material, which is often a copy of the entire Web.
On the other hand, search engines care about authority very intimately.
#PageRank (the algorithm that got @Google going) was committed to a better authority signal to score pages, based on the citations they got from other high scoring pages.
Have you seen ChatGPT combine info on multiple entities into an answer that’s completely WRONG? 😬
Generative AI and LLM models can mix up names or concepts & confidently regurgitate frankenanswers.
Neeva is solving this problem on our AI-powered search engine.
Here’s how 🧵
FYI This is a two-part thread series.
Today, with the help of @rahilbathwal, we’ll explain why the problems happen technically.
Tomorrow, we’ll talk through how we’re implementing our solution with our AI/ML team.
Make sure you're following... 👀
In frankenanswers, a generative AI model combines information about multiple possible entities into an answer that’s wrong.
Ex) On this query for `imran ahmed’ from our early test builds, you see a mix up of many intents corresponding to different entities with the same name.👇
2/ First off, we found that there are far fewer resources available for optimizing encoder-decoder models (when compared to encoder models like BERT and decoder models like GPT).
We hope this thread will fill in the void and serve as a good resource. 📂
3/ We started with a flan-T5-large model and tuned it on our dataset. We picked the large variant because we found it to generate better summaries with fewer hallucinations and fluency issues.
The problem? The latency is too high for a search product.