KL Divergence has its origins in information theory. The primary goal of information theory is to quantify how much information is in data. The most important metric in information theory is called Entropy
A 340 page huge report on AI trends - released by @bondcap
Some wild findings from this report.
🧵1/n
🧵2/n
Meta’s Llama Downloads Exploded 3.4× in Eight Months.
an unprecedented developer adoption curve for any open-source LLM.
bondcap. com/reports/tai
🧵3/n
AI Chatbots Now Mistaken as Human 73 Percent of the Time
In Q1 2025, testers mistook AI responses for human replies 73 percent of the time in Turing-style experiments. That’s up from roughly 50 percent only six months earlier—showing how quickly models have learned to mimic human conversational nuance
🚨 BREAKING: The first-ever agentic browser is here — and it's shockingly good.
Just tried @FellouAI, an AI browser that doesn’t assist you with browsing, it does the browsing for me.
It's like Chrome but with a brain—AI agents handle deep research and workflows solo.
Handles several projects in parallel.
A top-tier AI intern — takes care of all the dirty and tedious work, so you don’t have to and its 100% Free
1️⃣Fellou’s not just another browser—it's an Agentic assistant that acts for you.
2️⃣It handles real tasks autonomously: research, cross-platform flows, and full automation.
3️⃣ Past browsing. Into real action.
Fellou can automatically plan tasks, invoke tools, and execute actions to coordinate operations across multiple web interfaces, enabling various in-browser tasks. These include shopping, scheduling meetings, sending emails, and posting tweets based on webpage content.
It’s the first Agentic Browser — with deep research, tab-level collaboration, and seamless automation.
Deep Search acts like a smart intern: spins up five shadow browsers, digs across web and private platforms, and compiles richer insights fast. Highlights gaps and surfaces info you missed. Runs in parallel, won’t slow anything down.
Automated workflows: Replaces manual clicking with invisible ops across pages. Reduces drag, frees up hours.
Automation-aware browsing: Ask the page questions, reuse content in your drafts.
Wow.. Now you can transcribe 60 minutes of audio in just 1 second with a completely open-sourced model 🤯
@nvidia just open-sourced Parakeet TDT 0.6B V2, a 600M parameter automatic speech recognition (ASR) model that tops the @huggingface Open-ASR leaderboard with RTFx 3380
It's open-sourced under CC-BY-4.0, ready for commercial use.
⚙️ The Details
→ Built on FastConformer encoder + TDT decoder, the model handles up to 24-minute audio chunks with full attention and outputs with punctuation, capitalization, and accurate word/char/segment timestamps.
→ It achieves RTFx 3380 at batch size 128 on the Open ASR leaderboard, but performance varies with audio duration and batch size.
→ Trained using 150K steps on 128 A100 GPUs, then fine-tuned on 500 hours of high-quality human-transcribed English data.
→ Total training data spans 120K hours, combining human-labeled and pseudo-labeled sources, including LibriSpeech, Fisher, YTC, YODAS, and more.
→ Available via NVIDIA NeMo, optimized for GPU inference, and installable via pip install -U nemo_toolkit['asr'].
→ Compatible with Linux, runs on Ampere, Blackwell, Hopper, Volta GPU architectures, requiring minimum 2GB RAM.
→ Granary dataset used for training will be made public post Interspeech 2025.
How to Use this Model:
To train, fine-tune or play with the model you will need to install NVIDIA NeMo. Its recommended that you install it after you've installed latest PyTorch version.
Finally got access to @ManusAI_HQ and calling it a "Deepseek moment" is incorrect.
Its far more powerful. This is the world’s top AI-driven computer.
Think Deep Research + Claude + OpenAI Operator… all on steroids.
Within the next 1 year
12 wild example 🧵1/n
🧵2/n
Tesla FSD gets you there, Manus AI makes sure you have something to say.
ManusAI demonstrates self-correction by identifying issues and adapting solutions. Without this, AI agents risk compounding errors and becoming ineffective.
DeepSeek R1 was just the start—this new Chinese research from @Kimi_Moonshot lets RAG AI agents devour entire codebases and documentation with no context limits.
Mixture of Experts and Sparse attention make near-infinite context possible.
🧵1/n
📌 Challenge of Long-Context Attention
Transformers still face heavy computational loads when sequences become extremely large. The default attention pattern compares every token with every other token, creating costs that scale quadratically. This overhead becomes problematic when reading entire codebases, multi-chapter documents, or large legal texts.
📌Mixture of Block Attention (MoBA)
MoBA applies Mixture of Experts ideas to attention. The model divides input sequences into blocks, then a trainable gating function computes an affinity score between each query token and each block. Only the highest-scoring blocks get used in the attention, which removes the need to attend to every token in the full sequence.
Blocks are defined by segmenting the sequence into equal spans. Each query looks at a pooled representation of the keys in each block (for example, by mean-pooling), ranks their importance, and picks a few blocks for detailed attention. The block that contains the query is always included. A causal mask ensures tokens never see future information, preserving left-to-right generation.
📌Seamless Switch between Sparse and Full Attention
MoBA replaces normal attention without changing parameter counts. It remains compatible with standard Transformer interfaces, so it can switch between sparse and full attention in different layers or during different training phases. Some layers might keep full attention for specialized tasks (like supervised fine-tuning) while most layers use MoBA to cut costs.
📌 This fits into a larger Transformer stack by replacing standard attention calls. The gating ensures each query focuses on a manageable subset of blocks. Causality is handled by filtering out blocks in the future and by applying local masks within the query’s current block.
📌 The below figure shows queries being routed to only a few “expert” blocks of keys/values instead of the entire sequence. The gating mechanism assigns each query to the most relevant blocks, which cuts attention computations from quadratic to sub-quadratic.
📌 The gating mechanism computes a relevance score between each query and a condensed representation of each block. It then picks the top‑k blocks for every query, regardless of how far away those blocks are in the sequence.
Because each query only processes a few blocks, the computation remains sub‑quadratic, yet the model can still jump to distant tokens if the gating scores indicate high relevance.
🧵2/n
A Pytorch Implementation below
This pseudocode splits the keys and values into blocks, computes a mean-pooled representation of each block, and calculates gating scores (S) by multiplying Q with that pooled representation.
📌 It then applies a causal mask so queries cannot attend to future blocks, uses a top‑k operator to pick the most relevant blocks for each query, and organizes the data for efficient attention computation.
📌FlashAttention is applied separately to the self-attention block (current positions) and the MoBA-selected blocks, and the outputs are finally merged using an online softmax.
📌The result is a sparse attention mechanism that preserves causal structure and captures long-range dependencies without incurring the full quadratic cost of standard attention.
This code combines mixture-of-experts logic with sparse attention so each query only attends to a few blocks.
The gating mechanism scores each block against the query and selects the top‑k “experts,” reducing the number of key/value comparisons.
This keeps attention overhead sub‑quadratic, making it feasible to handle extremely long inputs without blowing up in compute or memory.
At the same time, the gating ensures queries can still attend to distant tokens when necessary, preserving the Transformer’s capacity for global context.
This block‑and‑gating strategy is how MoBA achieves near‑infinite context in LLMs.
🧵3/n
Experimental Observations
Models using MoBA show language modeling losses and downstream task performance nearly matching full attention. Results stay consistent even at context lengths of hundreds of thousands or millions of tokens. Experiments with “trailing token” evaluations confirm that important faraway dependencies can still be captured when queries identify relevant blocks.
Scalability tests indicate a sub-quadratic cost curve. Researchers report up to six-fold speedups at one million tokens and even larger gains beyond that range.
MoBA remains memory-friendly by avoiding a full attention matrix and by using standard GPU kernels for block-based computations.
NVIDIA + Arc Institute's new model Evo 2 just demonstrated that deep learning can directly model biological function
It stands as a breakthrough in computational biology,
🧵 1/n
Evo 2 just redefined genomic modeling by processing over 9 trillion nucleotides to seamlessly connect molecular detail with genome-scale structure.
Whats more, the entire model, training code, inference code, and curated OpenGenome2 dataset are released under open terms to accelerate progress in AI-driven genomics.
--------
Genome engineering efforts need a general-purpose model that can capture molecular, cellular, and organism-level features from DNA alone. This project addresses that gap by creating Evo 2, a foundation model trained on over 9 trillion DNA bases, covering bacteria, archaea, eukaryotes, and phage.
Its capacity for a 1-million token context window ensures that both local motifs and long-range dependencies are captured in a single pass. This design allows Evo 2 to model everything from single-nucleotide mutations to whole-genome architecture without task-specific tuning.
It learns diverse genetic patterns without labels or alignments, working at scales from small coding regions to entire genomes.
--------
What's the key benefit of it for us
It means that Evo 2 automatically detects key genetic signals and accurately predicts how various mutations impact molecular and organismal function.
The model's breakthroughs can lead to better disease diagnosis, more effective treatments, and improved agricultural or environmental solutions
🧵 2/n
📌 Model Architecture and Training Pipeline
StripedHyena 2 forms the core of Evo 2. It is a multi-hybrid convolutional architecture, mixing short, medium, and long input-dependent convolution layers with attention blocks.
This design handles sequences of up to 1 million tokens.
Training proceeded in two stages: a pretraining phase (8,192-token context) followed by midtraining that progressively extended context length (up to 1M tokens).
Data weighting placed extra emphasis on functionally dense regions (genic windows) before switching to full-genome segments.
🧵 3/n
📌 Zero-Shot Mutation Effect Predictions
Evo 2 captures fundamental coding and regulatory features. It recognizes start/stop codons and triplet periodicity and discerns functional disruptions caused by point mutations, frameshifts, or stop-codon insertions.
Prokaryotic and eukaryotic fitness screens confirmed that Evo 2’s likelihood scores correlate well with experimentally measured mutational effects in proteins, RNA molecules, and entire organisms. It also surpasses older genome-scale models on multiple regulatory variant benchmarks.