Rohan Paul Profile picture
May 28, 2022 6 tweets 16 min read Read on X
Kullback-Leibler (KL) Divergence - A Thread

It is a measure of how one probability distribution diverges from another expected probability distribution.

#DataScience #Statistics #DeepLearning #ComputerVision #100DaysOfMLCode #Python #programming #ArtificialIntelligence #Data
KL Divergence has its origins in information theory. The primary goal of information theory is to quantify how much information is in data. The most important metric in information theory is called Entropy

#DataScience #Statistics #DeepLearning #ComputerVision #100DaysOfMLCode

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Rohan Paul

Rohan Paul Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rohanpaul_ai

Jan 26
DeepSeek R1 running locally - Full setup guide Image
The model is DeepSeek R1 Distill Qwen 7B Image
Read 7 tweets
Jan 24
One prompt. Structured data. From any website.

And this is with @firecrawl_dev Extract, the new feature they just launched and I am finding it just incredibly helpful in my daily work.

🧵1/n

It reimagines web scraping. Using natural language, you can now extract data from single pages, entire domains (with wildcards), and even JavaScript-heavy sites – all without scripting.

Open beta is live, and it's the greatest simplifications of the web-scraping job.

No more fighting with selectors and XPath queries. Firecrawl Extract uses the power of LLMs to understand the data needs and intelligently pull information from the web, turning messy HTML into clean, structured data ready for your applications.

Imagine telling a tool, "Extract the product name, price, and customer reviews from this page," and having it deliver exactly that – in a neat, structured format like JSON.

What Makes Extract so Powerful?

It's a smart data extraction engine.

- Adaptable to Website Changes: Websites are constantly evolving. Traditional scripts break when layouts change. Extract, is designed to be more resilient and adapt to minor website tweaks without needing constant script rewrites.

- Scalable Data Collection: Extract isn't limited to single pages. You can target multiple URLs, entire domains using wildcards, and even leverage web search to enrich your data.

- Seamless Integration: It offers:
→ Zapier Integration: Connect Extract to thousands of apps for automated workflows, data enrichment, and pushing data into your favorite CRMs or spreadsheets – all without writing a single line of code.
→ Python and Node.js SDKs: For developers who want more control, SDKs provide easy integration into existing projects.

- Handles Dynamic Content: Websites are increasingly dynamic, relying heavily on JavaScript. Extract leverages Firecrawl's robust `/scrape` endpoint to render JavaScript-heavy pages, ensuring you capture data even from complex modern websites.

- Extract can be used to efficiently gather datasets from the web for LLM training, handling multilingual sites and dynamic content like prices and inventory.Image
🧵 2/n

This example uses DeepSeek R1 as web crawler with @firecrawl_dev 's /extract.
Watch R1 select URLs and filter results while /extract scans for the structured data on the websites.
🧵 3/n

Checkout more details here -
firecrawl.dev/extract

Basic Data Extraction from a Single URL

Imagine you want to extract key information from the Firecrawl homepage. You could ask Extract to find the company mission, whether they support SSO, are open source, and if they are part of Y Combinator.

You can define your request using a simple schema or just a natural language prompt. Let's look at an example response structure:

```json
{
"company_mission": "...",
"supports_sso": false,
"is_open_source": true,
"is_in_yc": true
}
```

Using the Firecrawl SDK (Python example):

This simple code snippet sends a request to Firecrawl Extract with the target URL and your desired data points described in the `schema`. The response will contain the structured data as JSON, just like the example shown earlier in this blog post.Image
Read 10 tweets
Jan 17
Your brain's next 5 seconds, predicted by AI

Transformer predicts brain activity patterns 5 seconds into future using just 21 seconds of fMRI data

Achieves 0.997 correlation using modified time-series Transformer architecture

-----

🧠 Original Problem:

Predicting future brain states from fMRI data remains challenging, especially for patients who can't undergo long scanning sessions. Current methods require extensive scan times and lack accuracy in short-term predictions.

-----

🔬 Solution in this Paper:

→ The paper introduces a modified time series Transformer with 4 encoder and 4 decoder layers, each containing 8 attention heads

→ The model takes a 30-timepoint window covering 379 brain regions as input and predicts the next brain state

→ Training uses Human Connectome Project data from 1003 healthy adults, with preprocessing including spatial smoothing and bandpass filtering

→ Unlike traditional approaches, this model omits look-ahead masking, simplifying prediction for single future timepoints

-----

🎯 Key Insights:

→ Temporal dependencies in brain states can be effectively captured using self-attention mechanisms

→ Short input sequences (21.6s) suffice for accurate predictions

→ Error accumulation follows a Markov chain pattern in longer predictions

→ The model preserves functional connectivity patterns matching known brain organization

-----

📊 Results:

→ Single timepoint prediction achieves MSE of 0.0013

→ Accurate predictions up to 5.04 seconds with correlation >0.85

→ First 7 predicted timepoints maintain high accuracy

→ Outperforms BrainLM with 20-timepoint MSE of 0.26 vs 0.568Image
Paper Title: "Predicting Human Brain States with Transformer"

Generated below podcast on this paper with Google's Illuminate.
Read 6 tweets
Dec 23, 2024
Most valuable data exists in PDFs, images, and other formats LLMs can't directly process, creating a critical barrier to AI adoption across industries.

And converting documents into LLM-compatible formats requires complex technical pipelines, while existing vision models often deliver subpar reasoning capabilities.

To solve this problem, @FireworksAI_HQ just released Document Inlining.

🎖️Result - OSS models with Document Inlining achieving a 68% win rate against GPT-4o at document processing.

A Thread🧵(1/n)

Document Inlining turns ANY LLM into a vision model to excel at processing documents, providing

- Higher quality - Better reasoning by feeding text into text models.

- Input flexibility - Automatically handles rich document structure like tables/charts and takes PDFs and multiple images as inputs

- Ultra-simple usage - Works through a 1-line edit to their OpenAI-compatible API

- Model flexibility - Use any LLM, including fine-tuned and specialized modelsImage
🧵(2/n)

Read more or get started in their UI playground now!



API is fully OpenAI compatible. Enable this capability by editing 1-line to specify “#transform=inline” alongside your file fireworks.ai/blog/document-…Image
🧵(3/n)
Most of the world’s data isn't LLM-friendly. Data like medical records exist in images, PDFs and other formats that aren’t easily ingested by LLMs.

Organizations need to process documents with AI but face a choice between building complex pipelines or using limited vision models.

Document Inlining automatically connects any LLM to their proprietary parsing service to effortlessly provide improved reasoning and handle complex file typesImage
Read 8 tweets
Dec 12, 2024
Synthetic data and iterative self-improvement is all you need.

No humans needed in the evaluation loop.

This paper introduces a self-improving evaluator that learns to assess LLM outputs without human feedback, using synthetic data and iterative self-training to match top human-supervised models.

-----

Original Problem 🤔:

Building strong LLM evaluators typically requires extensive human preference data, which is costly and becomes outdated as models improve. Current approaches rely heavily on human annotations, limiting scalability and adaptability.

-----

Solution in this Paper 🔧:

→ The method starts with unlabeled instructions and uses a seed LLM to generate contrasting response pairs, where one is intentionally inferior.

→ It then uses an LLM-as-Judge approach to generate reasoning traces and final judgments for these synthetic pairs.

→ The system filters correct judgments and uses them to train an improved evaluator model.

→ This process repeats iteratively, with each iteration using the improved model to generate better synthetic training data.

-----

Key Insights from this Paper 💡:

→ Human preference data isn't necessary for training strong LLM evaluators

→ Synthetic data generation with iterative self-improvement can match human-supervised approaches

→ Different data sources (safety, math, coding) improve performance in their respective domains

-----

Results 📊:

→ Improved RewardBench accuracy from 75.4 to 88.3 (88.7 with majority voting)

→ Outperformed GPT-4 (84.3) and matched top reward models trained with human data

→ Achieved 79.5% agreement with human judgments on MT-Bench using majority votingImage
The diagram shows how an AI system learns to evaluate responses without human help, using an iterative training process:

1. Input Stage 🎯
- It starts with a prompt (x)
- Creates a similar but slightly different version of that prompt (x')

2. Response Generation 🔄
- The system uses an LLM to create two responses:
- A "good" response to the original prompt
- A "bad" response by answering the modified prompt

3. Judgment Phase 📊
- An AI judge (Mi) evaluates these responses
- It samples multiple judgments about which response is better
- The system selects only the correct verdicts

4. Training Loop ⚙️
- These judgments are collected as training data
- The system uses this data to train an improved version of itself (Mi+1)
- This new, better model becomes the judge for the next round

Think of it like a student who:
1. Creates their own practice problems
2. Solves them in both good and not-so-good ways
3. Learns to tell the difference between good and bad solutions
4. Uses this knowledge to get even better at judging solutions

The key innovation is that this entire process runs automatically, without needing humans to say which answers are good or bad. The system teaches itself to become a better evaluator through practice and iteration.Image
Paper Title: "Self-Taught Evaluators"

Generated below podcast on this paper with Google's Illuminate.
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(