Rohan Paul Profile picture
May 28, 2022 6 tweets 16 min read Read on X
Kullback-Leibler (KL) Divergence - A Thread

It is a measure of how one probability distribution diverges from another expected probability distribution.

#DataScience #Statistics #DeepLearning #ComputerVision #100DaysOfMLCode #Python #programming #ArtificialIntelligence #Data
KL Divergence has its origins in information theory. The primary goal of information theory is to quantify how much information is in data. The most important metric in information theory is called Entropy

#DataScience #Statistics #DeepLearning #ComputerVision #100DaysOfMLCode

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Rohan Paul

Rohan Paul Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rohanpaul_ai

Feb 20
DeepSeek R1 was just the start—this new Chinese research from @Kimi_Moonshot lets RAG AI agents devour entire codebases and documentation with no context limits.

Mixture of Experts and Sparse attention make near-infinite context possible.

🧵1/n

📌 Challenge of Long-Context Attention

Transformers still face heavy computational loads when sequences become extremely large. The default attention pattern compares every token with every other token, creating costs that scale quadratically. This overhead becomes problematic when reading entire codebases, multi-chapter documents, or large legal texts.

📌Mixture of Block Attention (MoBA)

MoBA applies Mixture of Experts ideas to attention. The model divides input sequences into blocks, then a trainable gating function computes an affinity score between each query token and each block. Only the highest-scoring blocks get used in the attention, which removes the need to attend to every token in the full sequence.

Blocks are defined by segmenting the sequence into equal spans. Each query looks at a pooled representation of the keys in each block (for example, by mean-pooling), ranks their importance, and picks a few blocks for detailed attention. The block that contains the query is always included. A causal mask ensures tokens never see future information, preserving left-to-right generation.

📌Seamless Switch between Sparse and Full Attention

MoBA replaces normal attention without changing parameter counts. It remains compatible with standard Transformer interfaces, so it can switch between sparse and full attention in different layers or during different training phases. Some layers might keep full attention for specialized tasks (like supervised fine-tuning) while most layers use MoBA to cut costs.

📌 This fits into a larger Transformer stack by replacing standard attention calls. The gating ensures each query focuses on a manageable subset of blocks. Causality is handled by filtering out blocks in the future and by applying local masks within the query’s current block.

📌 The below figure shows queries being routed to only a few “expert” blocks of keys/values instead of the entire sequence. The gating mechanism assigns each query to the most relevant blocks, which cuts attention computations from quadratic to sub-quadratic.

📌 The gating mechanism computes a relevance score between each query and a condensed representation of each block. It then picks the top‑k blocks for every query, regardless of how far away those blocks are in the sequence.

Because each query only processes a few blocks, the computation remains sub‑quadratic, yet the model can still jump to distant tokens if the gating scores indicate high relevance.Image
🧵2/n

A Pytorch Implementation below

This pseudocode splits the keys and values into blocks, computes a mean-pooled representation of each block, and calculates gating scores (S) by multiplying Q with that pooled representation.

📌 It then applies a causal mask so queries cannot attend to future blocks, uses a top‑k operator to pick the most relevant blocks for each query, and organizes the data for efficient attention computation.

📌FlashAttention is applied separately to the self-attention block (current positions) and the MoBA-selected blocks, and the outputs are finally merged using an online softmax.

📌The result is a sparse attention mechanism that preserves causal structure and captures long-range dependencies without incurring the full quadratic cost of standard attention.

This code combines mixture-of-experts logic with sparse attention so each query only attends to a few blocks.

The gating mechanism scores each block against the query and selects the top‑k “experts,” reducing the number of key/value comparisons.

This keeps attention overhead sub‑quadratic, making it feasible to handle extremely long inputs without blowing up in compute or memory.

At the same time, the gating ensures queries can still attend to distant tokens when necessary, preserving the Transformer’s capacity for global context.

This block‑and‑gating strategy is how MoBA achieves near‑infinite context in LLMs.Image
Image
🧵3/n

Experimental Observations

Models using MoBA show language modeling losses and downstream task performance nearly matching full attention. Results stay consistent even at context lengths of hundreds of thousands or millions of tokens. Experiments with “trailing token” evaluations confirm that important faraway dependencies can still be captured when queries identify relevant blocks.

Scalability tests indicate a sub-quadratic cost curve. Researchers report up to six-fold speedups at one million tokens and even larger gains beyond that range.

MoBA remains memory-friendly by avoiding a full attention matrix and by using standard GPU kernels for block-based computations.Image
Read 5 tweets
Feb 20
NVIDIA + Arc Institute's new model Evo 2 just demonstrated that deep learning can directly model biological function

It stands as a breakthrough in computational biology,

🧵 1/n

Evo 2 just redefined genomic modeling by processing over 9 trillion nucleotides to seamlessly connect molecular detail with genome-scale structure.

Whats more, the entire model, training code, inference code, and curated OpenGenome2 dataset are released under open terms to accelerate progress in AI-driven genomics.

--------

Genome engineering efforts need a general-purpose model that can capture molecular, cellular, and organism-level features from DNA alone. This project addresses that gap by creating Evo 2, a foundation model trained on over 9 trillion DNA bases, covering bacteria, archaea, eukaryotes, and phage.

Its capacity for a 1-million token context window ensures that both local motifs and long-range dependencies are captured in a single pass. This design allows Evo 2 to model everything from single-nucleotide mutations to whole-genome architecture without task-specific tuning.

It learns diverse genetic patterns without labels or alignments, working at scales from small coding regions to entire genomes.

--------

What's the key benefit of it for us

It means that Evo 2 automatically detects key genetic signals and accurately predicts how various mutations impact molecular and organismal function.

The model's breakthroughs can lead to better disease diagnosis, more effective treatments, and improved agricultural or environmental solutionsImage
🧵 2/n

📌 Model Architecture and Training Pipeline

StripedHyena 2 forms the core of Evo 2. It is a multi-hybrid convolutional architecture, mixing short, medium, and long input-dependent convolution layers with attention blocks.

This design handles sequences of up to 1 million tokens.

Training proceeded in two stages: a pretraining phase (8,192-token context) followed by midtraining that progressively extended context length (up to 1M tokens).

Data weighting placed extra emphasis on functionally dense regions (genic windows) before switching to full-genome segments.Image
🧵 3/n

📌 Zero-Shot Mutation Effect Predictions

Evo 2 captures fundamental coding and regulatory features. It recognizes start/stop codons and triplet periodicity and discerns functional disruptions caused by point mutations, frameshifts, or stop-codon insertions.

Prokaryotic and eukaryotic fitness screens confirmed that Evo 2’s likelihood scores correlate well with experimentally measured mutational effects in proteins, RNA molecules, and entire organisms. It also surpasses older genome-scale models on multiple regulatory variant benchmarks.Image
Read 8 tweets
Jan 26
DeepSeek R1 running locally - Full setup guide Image
The model is DeepSeek R1 Distill Qwen 7B Image
Read 7 tweets
Jan 24
One prompt. Structured data. From any website.

And this is with @firecrawl_dev Extract, the new feature they just launched and I am finding it just incredibly helpful in my daily work.

🧵1/n

It reimagines web scraping. Using natural language, you can now extract data from single pages, entire domains (with wildcards), and even JavaScript-heavy sites – all without scripting.

Open beta is live, and it's the greatest simplifications of the web-scraping job.

No more fighting with selectors and XPath queries. Firecrawl Extract uses the power of LLMs to understand the data needs and intelligently pull information from the web, turning messy HTML into clean, structured data ready for your applications.

Imagine telling a tool, "Extract the product name, price, and customer reviews from this page," and having it deliver exactly that – in a neat, structured format like JSON.

What Makes Extract so Powerful?

It's a smart data extraction engine.

- Adaptable to Website Changes: Websites are constantly evolving. Traditional scripts break when layouts change. Extract, is designed to be more resilient and adapt to minor website tweaks without needing constant script rewrites.

- Scalable Data Collection: Extract isn't limited to single pages. You can target multiple URLs, entire domains using wildcards, and even leverage web search to enrich your data.

- Seamless Integration: It offers:
→ Zapier Integration: Connect Extract to thousands of apps for automated workflows, data enrichment, and pushing data into your favorite CRMs or spreadsheets – all without writing a single line of code.
→ Python and Node.js SDKs: For developers who want more control, SDKs provide easy integration into existing projects.

- Handles Dynamic Content: Websites are increasingly dynamic, relying heavily on JavaScript. Extract leverages Firecrawl's robust `/scrape` endpoint to render JavaScript-heavy pages, ensuring you capture data even from complex modern websites.

- Extract can be used to efficiently gather datasets from the web for LLM training, handling multilingual sites and dynamic content like prices and inventory.Image
🧵 2/n

This example uses DeepSeek R1 as web crawler with @firecrawl_dev 's /extract.
Watch R1 select URLs and filter results while /extract scans for the structured data on the websites.
🧵 3/n

Checkout more details here -
firecrawl.dev/extract

Basic Data Extraction from a Single URL

Imagine you want to extract key information from the Firecrawl homepage. You could ask Extract to find the company mission, whether they support SSO, are open source, and if they are part of Y Combinator.

You can define your request using a simple schema or just a natural language prompt. Let's look at an example response structure:

```json
{
"company_mission": "...",
"supports_sso": false,
"is_open_source": true,
"is_in_yc": true
}
```

Using the Firecrawl SDK (Python example):

This simple code snippet sends a request to Firecrawl Extract with the target URL and your desired data points described in the `schema`. The response will contain the structured data as JSON, just like the example shown earlier in this blog post.Image
Read 10 tweets
Jan 17
Your brain's next 5 seconds, predicted by AI

Transformer predicts brain activity patterns 5 seconds into future using just 21 seconds of fMRI data

Achieves 0.997 correlation using modified time-series Transformer architecture

-----

🧠 Original Problem:

Predicting future brain states from fMRI data remains challenging, especially for patients who can't undergo long scanning sessions. Current methods require extensive scan times and lack accuracy in short-term predictions.

-----

🔬 Solution in this Paper:

→ The paper introduces a modified time series Transformer with 4 encoder and 4 decoder layers, each containing 8 attention heads

→ The model takes a 30-timepoint window covering 379 brain regions as input and predicts the next brain state

→ Training uses Human Connectome Project data from 1003 healthy adults, with preprocessing including spatial smoothing and bandpass filtering

→ Unlike traditional approaches, this model omits look-ahead masking, simplifying prediction for single future timepoints

-----

🎯 Key Insights:

→ Temporal dependencies in brain states can be effectively captured using self-attention mechanisms

→ Short input sequences (21.6s) suffice for accurate predictions

→ Error accumulation follows a Markov chain pattern in longer predictions

→ The model preserves functional connectivity patterns matching known brain organization

-----

📊 Results:

→ Single timepoint prediction achieves MSE of 0.0013

→ Accurate predictions up to 5.04 seconds with correlation >0.85

→ First 7 predicted timepoints maintain high accuracy

→ Outperforms BrainLM with 20-timepoint MSE of 0.26 vs 0.568Image
Paper Title: "Predicting Human Brain States with Transformer"

Generated below podcast on this paper with Google's Illuminate.
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(