KL Divergence has its origins in information theory. The primary goal of information theory is to quantify how much information is in data. The most important metric in information theory is called Entropy
DeepSeek R1 was just the start—this new Chinese research from @Kimi_Moonshot lets RAG AI agents devour entire codebases and documentation with no context limits.
Mixture of Experts and Sparse attention make near-infinite context possible.
🧵1/n
📌 Challenge of Long-Context Attention
Transformers still face heavy computational loads when sequences become extremely large. The default attention pattern compares every token with every other token, creating costs that scale quadratically. This overhead becomes problematic when reading entire codebases, multi-chapter documents, or large legal texts.
📌Mixture of Block Attention (MoBA)
MoBA applies Mixture of Experts ideas to attention. The model divides input sequences into blocks, then a trainable gating function computes an affinity score between each query token and each block. Only the highest-scoring blocks get used in the attention, which removes the need to attend to every token in the full sequence.
Blocks are defined by segmenting the sequence into equal spans. Each query looks at a pooled representation of the keys in each block (for example, by mean-pooling), ranks their importance, and picks a few blocks for detailed attention. The block that contains the query is always included. A causal mask ensures tokens never see future information, preserving left-to-right generation.
📌Seamless Switch between Sparse and Full Attention
MoBA replaces normal attention without changing parameter counts. It remains compatible with standard Transformer interfaces, so it can switch between sparse and full attention in different layers or during different training phases. Some layers might keep full attention for specialized tasks (like supervised fine-tuning) while most layers use MoBA to cut costs.
📌 This fits into a larger Transformer stack by replacing standard attention calls. The gating ensures each query focuses on a manageable subset of blocks. Causality is handled by filtering out blocks in the future and by applying local masks within the query’s current block.
📌 The below figure shows queries being routed to only a few “expert” blocks of keys/values instead of the entire sequence. The gating mechanism assigns each query to the most relevant blocks, which cuts attention computations from quadratic to sub-quadratic.
📌 The gating mechanism computes a relevance score between each query and a condensed representation of each block. It then picks the top‑k blocks for every query, regardless of how far away those blocks are in the sequence.
Because each query only processes a few blocks, the computation remains sub‑quadratic, yet the model can still jump to distant tokens if the gating scores indicate high relevance.
🧵2/n
A Pytorch Implementation below
This pseudocode splits the keys and values into blocks, computes a mean-pooled representation of each block, and calculates gating scores (S) by multiplying Q with that pooled representation.
📌 It then applies a causal mask so queries cannot attend to future blocks, uses a top‑k operator to pick the most relevant blocks for each query, and organizes the data for efficient attention computation.
📌FlashAttention is applied separately to the self-attention block (current positions) and the MoBA-selected blocks, and the outputs are finally merged using an online softmax.
📌The result is a sparse attention mechanism that preserves causal structure and captures long-range dependencies without incurring the full quadratic cost of standard attention.
This code combines mixture-of-experts logic with sparse attention so each query only attends to a few blocks.
The gating mechanism scores each block against the query and selects the top‑k “experts,” reducing the number of key/value comparisons.
This keeps attention overhead sub‑quadratic, making it feasible to handle extremely long inputs without blowing up in compute or memory.
At the same time, the gating ensures queries can still attend to distant tokens when necessary, preserving the Transformer’s capacity for global context.
This block‑and‑gating strategy is how MoBA achieves near‑infinite context in LLMs.
🧵3/n
Experimental Observations
Models using MoBA show language modeling losses and downstream task performance nearly matching full attention. Results stay consistent even at context lengths of hundreds of thousands or millions of tokens. Experiments with “trailing token” evaluations confirm that important faraway dependencies can still be captured when queries identify relevant blocks.
Scalability tests indicate a sub-quadratic cost curve. Researchers report up to six-fold speedups at one million tokens and even larger gains beyond that range.
MoBA remains memory-friendly by avoiding a full attention matrix and by using standard GPU kernels for block-based computations.
NVIDIA + Arc Institute's new model Evo 2 just demonstrated that deep learning can directly model biological function
It stands as a breakthrough in computational biology,
🧵 1/n
Evo 2 just redefined genomic modeling by processing over 9 trillion nucleotides to seamlessly connect molecular detail with genome-scale structure.
Whats more, the entire model, training code, inference code, and curated OpenGenome2 dataset are released under open terms to accelerate progress in AI-driven genomics.
--------
Genome engineering efforts need a general-purpose model that can capture molecular, cellular, and organism-level features from DNA alone. This project addresses that gap by creating Evo 2, a foundation model trained on over 9 trillion DNA bases, covering bacteria, archaea, eukaryotes, and phage.
Its capacity for a 1-million token context window ensures that both local motifs and long-range dependencies are captured in a single pass. This design allows Evo 2 to model everything from single-nucleotide mutations to whole-genome architecture without task-specific tuning.
It learns diverse genetic patterns without labels or alignments, working at scales from small coding regions to entire genomes.
--------
What's the key benefit of it for us
It means that Evo 2 automatically detects key genetic signals and accurately predicts how various mutations impact molecular and organismal function.
The model's breakthroughs can lead to better disease diagnosis, more effective treatments, and improved agricultural or environmental solutions
🧵 2/n
📌 Model Architecture and Training Pipeline
StripedHyena 2 forms the core of Evo 2. It is a multi-hybrid convolutional architecture, mixing short, medium, and long input-dependent convolution layers with attention blocks.
This design handles sequences of up to 1 million tokens.
Training proceeded in two stages: a pretraining phase (8,192-token context) followed by midtraining that progressively extended context length (up to 1M tokens).
Data weighting placed extra emphasis on functionally dense regions (genic windows) before switching to full-genome segments.
🧵 3/n
📌 Zero-Shot Mutation Effect Predictions
Evo 2 captures fundamental coding and regulatory features. It recognizes start/stop codons and triplet periodicity and discerns functional disruptions caused by point mutations, frameshifts, or stop-codon insertions.
Prokaryotic and eukaryotic fitness screens confirmed that Evo 2’s likelihood scores correlate well with experimentally measured mutational effects in proteins, RNA molecules, and entire organisms. It also surpasses older genome-scale models on multiple regulatory variant benchmarks.
And this is with @firecrawl_dev Extract, the new feature they just launched and I am finding it just incredibly helpful in my daily work.
🧵1/n
It reimagines web scraping. Using natural language, you can now extract data from single pages, entire domains (with wildcards), and even JavaScript-heavy sites – all without scripting.
Open beta is live, and it's the greatest simplifications of the web-scraping job.
No more fighting with selectors and XPath queries. Firecrawl Extract uses the power of LLMs to understand the data needs and intelligently pull information from the web, turning messy HTML into clean, structured data ready for your applications.
Imagine telling a tool, "Extract the product name, price, and customer reviews from this page," and having it deliver exactly that – in a neat, structured format like JSON.
What Makes Extract so Powerful?
It's a smart data extraction engine.
- Adaptable to Website Changes: Websites are constantly evolving. Traditional scripts break when layouts change. Extract, is designed to be more resilient and adapt to minor website tweaks without needing constant script rewrites.
- Scalable Data Collection: Extract isn't limited to single pages. You can target multiple URLs, entire domains using wildcards, and even leverage web search to enrich your data.
- Seamless Integration: It offers:
→ Zapier Integration: Connect Extract to thousands of apps for automated workflows, data enrichment, and pushing data into your favorite CRMs or spreadsheets – all without writing a single line of code.
→ Python and Node.js SDKs: For developers who want more control, SDKs provide easy integration into existing projects.
- Handles Dynamic Content: Websites are increasingly dynamic, relying heavily on JavaScript. Extract leverages Firecrawl's robust `/scrape` endpoint to render JavaScript-heavy pages, ensuring you capture data even from complex modern websites.
- Extract can be used to efficiently gather datasets from the web for LLM training, handling multilingual sites and dynamic content like prices and inventory.
🧵 2/n
This example uses DeepSeek R1 as web crawler with @firecrawl_dev 's /extract.
Watch R1 select URLs and filter results while /extract scans for the structured data on the websites.
Imagine you want to extract key information from the Firecrawl homepage. You could ask Extract to find the company mission, whether they support SSO, are open source, and if they are part of Y Combinator.
You can define your request using a simple schema or just a natural language prompt. Let's look at an example response structure:
This simple code snippet sends a request to Firecrawl Extract with the target URL and your desired data points described in the `schema`. The response will contain the structured data as JSON, just like the example shown earlier in this blog post.
Transformer predicts brain activity patterns 5 seconds into future using just 21 seconds of fMRI data
Achieves 0.997 correlation using modified time-series Transformer architecture
-----
🧠 Original Problem:
Predicting future brain states from fMRI data remains challenging, especially for patients who can't undergo long scanning sessions. Current methods require extensive scan times and lack accuracy in short-term predictions.
-----
🔬 Solution in this Paper:
→ The paper introduces a modified time series Transformer with 4 encoder and 4 decoder layers, each containing 8 attention heads
→ The model takes a 30-timepoint window covering 379 brain regions as input and predicts the next brain state
→ Training uses Human Connectome Project data from 1003 healthy adults, with preprocessing including spatial smoothing and bandpass filtering
→ Unlike traditional approaches, this model omits look-ahead masking, simplifying prediction for single future timepoints
-----
🎯 Key Insights:
→ Temporal dependencies in brain states can be effectively captured using self-attention mechanisms
→ Short input sequences (21.6s) suffice for accurate predictions
→ Error accumulation follows a Markov chain pattern in longer predictions
→ The model preserves functional connectivity patterns matching known brain organization
-----
📊 Results:
→ Single timepoint prediction achieves MSE of 0.0013
→ Accurate predictions up to 5.04 seconds with correlation >0.85
→ First 7 predicted timepoints maintain high accuracy
→ Outperforms BrainLM with 20-timepoint MSE of 0.26 vs 0.568
Paper Title: "Predicting Human Brain States with Transformer"
Generated below podcast on this paper with Google's Illuminate.