Compiling in real-time, the race towards AGI.
🗞️ Don't miss my daily top 1% AI analysis newsletter directly to your inbox 👉 https://t.co/6LBxO8215l
6 subscribers
Sep 12 • 6 tweets • 2 min read
Congrats to @CreaoAI for hitting #1 on Product Hunt (Sept 11) 🚀
just used it myself, and quite smooth experience.
CREAO is an AI Agent that builds full-stack mini-SaaS from one sentence.
One sentence in → frontend + backend + data layer out.
They are building a platform to provide the critical interface for people to build apps where humans and AI agents can collaborate seamlessly.
So its entire infrastructure is engineered with an "AI-native first" philosophy.
🧵1/n.
🧵2/n. ⚡ All-in-one build.
CREAO gave me a deployable product — frontend, backend, database together.
#1 on Product Hunt (Sept 11).
Sep 11 • 5 tweets • 3 min read
🇨🇳China unveils world's first brain-like AI Model SpikingBrain1.0
Upto 100X faster while being trained on less than 2% of the data typically required.
Designed to mimic human brain functionality, uses much less energy. A new paradigm in efficiency and hardware independence.
Marks a significant shift from current AI architectures
Unlike models such as GPT and LLaMA, which use attention mechanisms to process all input in parallel, SpikingBrain1.0 employs localized attention, focusing only on the most relevant recent context.
Potential Applications:
- Real-time, low-power environments
- Autonomous drones and edge computing
- Wearable devices requiring efficient processing
- Scenarios where energy consumption is critical
This project is part of a larger scientific pursuit of neuromorphic computing, which aims to replicate the remarkable efficiency of the human brain, which operates on only about 20 watts of power.
---
arxiv .org/abs/2509.05276
🧠 The idea for the Human-brain-inspired linear or hybrid-linear LLMs for the SpikingBrain architecture.
- SpikingBrain replaces most quadratic attention with linear and local attention, mixes in selective full attention where it matters, and adds an adaptive spiking activation so the model computes only on meaningful events.
- It proves the whole recipe works at scale by training and serving on MetaX C550 GPUs, which are non‑NVIDIA devices, without giving up quality on common benchmarks.
- The headline efficiencies come from 3 levers working together, linear attention for compressed memory, MoE for token-wise sparsity, and spiking for micro-level sparsity.
Sep 11 • 4 tweets • 3 min read
Fantastic paper from ByteDance 👏
Shows how to train LLM agents to finish long, multi step tasks by letting them act in real environments with reinforcement learning.
Across 27 tasks, the trained agents rival or beat top proprietary models.
Most agents are trained on single turn data, so they fail when a job needs many decisions with noisy feedback.
AgentGym-RL splits the system into separate parts, the environments, the agent loop, and training, so each can improve on its own.
It supports mainstream algorithms and realistic tasks, and the agent learns by acting, seeing results, and adjusting across different settings.
The key method, ScalingInter-RL, starts with short interactions to master basics, then slowly allows longer runs so the agent can explore and plan.
This staged horizon schedule stabilizes learning, prevents pointless loops, and encourages planning, reflection, and recovery after mistakes.
A 7B model trained with this setup matches or beats much larger open models and competes well with strong commercial ones.
They also find that putting more compute into training and test time interaction, like more steps or samples, often helps more than adding parameters.
How the AgentGym-RL framework works.
At the center is the LLM agent. It takes an instruction, interacts with an environment for several turns, and then produces actions. Each action changes the environment, and the environment sends feedback back to the agent. This cycle repeats many times.
The environment itself is handled by a server that can simulate different types of tasks. These include web browsing, searching, coding, playing games, doing science tasks, or controlling embodied agents. The environment client manages the interaction and communicates through standard protocols.
Every full cycle of actions and observations is called a trajectory. These trajectories are collected and then used to update the agent’s policy with reinforcement learning algorithms like PPO, GRPO, RLOO, or REINFORCE++.
The framework is modular. The environment, the agent, and the training part are separated. This makes it flexible, easy to extend, and suitable for many types of realistic tasks.
The diagram highlights how the agent learns not by memorizing answers, but by trying actions, getting feedback, and improving its decision making across different domains.
Sep 9 • 7 tweets • 4 min read
📢 Another Brilliant research just dropped from @GoogleResearch - a major advancement for a systematic way to generate expert-level scientific software automatically.
An LLM plus tree search turns scientific coding into a score driven search engine.
This work builds an LLM + Tree Search loop that writes and improves scientific code by chasing a single measurable score for each task.
The key idea is to treat coding for scientific tasks as a scorable search problem.
That means every candidate program can be judged by a simple numeric score, like how well it predicts, forecasts, or integrates data. Once you have a clear score, you can let a LLM rewrite code again and again, run the code in a sandbox, and use tree search to keep the best branches while discarding weaker ones
With compact research ideas injected into the prompt, the system reaches expert level and beats strong baselines across biology, epidemiology, geospatial, neuroscience, time series, and numerical methods.
Training speed: less than 2 hours on 1 T4 vs 36 hours on 16 A100s.
In bioinformatics, it came up with 40 new approaches for single-cell data analysis that beat the best human-designed methods on a public benchmark.
In epidemiology, it built 14 models that set state-of-the-art results for predicting COVID-19 hospitalizations.
🧵 Read on 👇
🧵2/n. ⚙️ The Core Concepts
Empirical software is code built to maximize a quality score on observed data, and any task that fits this framing becomes a scorable task.
This view turns software creation into a measurable search problem, because every candidate program is judged by the same numeric target.
This framing also explains why the method can travel across domains, since only the scoring function changes.
Sep 9 • 9 tweets • 3 min read
Fei-Fei Li (@drfeifei) on limitations of LLMs.
"There's no language out there in nature. You don't go out in nature and there's words written in the sky for you.. There is a 3D world that follows laws of physics."
Language is purely generated signal.
AI models trained on linguistic signals fail when the task requires embodied physical common sense in a world with real constraints.
Sep 8 • 10 tweets • 5 min read
BRILLIANT paper.
LLMs get stuck when they think too long in a single line, early tokens steer them into a narrow path and they rarely recover, which the authors call Tunnel Vision.
ParaThinker trains native parallel thinking, it spins up multiple distinct reasoning paths at once and then fuses them into 1 answer, which lifts accuracy a lot with tiny latency cost.
Sensational fact, if you only keep 1 thing: 12.3% average gain for 1.5B, 7.5% for 7B, with only 7.1% extra latency.
ParaThinker shows that training LLMs to think in parallel paths instead of just longer single chains avoids tunnel vision, giving up to 12.3% accuracy gains with only 7.1% extra latency, letting smaller models beat much larger ones.
🧵 Read on 👇
🧵2/n. 🧩 Why longer thinking stalls
When the model makes a mistake early on, it keeps building on that mistake.
The longer it goes down that wrong path, the less chance it has to recover.
This stuck behavior is what the authors call Tunnel Vision, and it explains why just letting the model think longer doesn’t always improve accuracy.
Sep 8 • 4 tweets • 3 min read
Another great @GoogleDeepMind paper.
Shows how to speed up LLM agents while cutting cost and keeping answers unchanged.
30% lower total cost and 60% less wasted cost at comparable acceleration.
Agents plan step by step, so each call waits for the previous one, which drags latency.
Speculative planning fixes that by having a cheap draft agent guess next steps while a stronger agent checks them in parallel.
Fixed guess lengths backfire, small guesses barely help, big guesses waste tokens when a check disagrees.
Dynamic Speculative Planning learns how far to guess, then stops early to avoid wasted calls.
A tiny online predictor learns how many steps will be right using reinforcement learning.
1 knob lets teams bias for speed or cost, either by skewing training or adding a small offset.
If a guess is wrong, extra threads stop and execution resumes from the verified step.
Across OpenAGI and TravelPlanner, the dynamic policy matches the fastest fixed policy while spending fewer tokens
The result is clear, faster responses, lower bills, and 0 loss in task quality.
How Dynamic Speculative Planning, manages when and how far to guess ahead during an agent’s planning.
The top line called Predictor decides how many future steps to guess, marked by k. For example, k=2 means guess 2 steps ahead, while k=3 means guess 3 steps ahead. These guesses are carried out by a lighter agent called Approximation, and then checked in parallel by a stronger agent called Target.
If the guesses match the stronger agent, they are confirmed and execution continues. If they don’t match, shown with an X, all ongoing speculative threads are canceled, and the system resumes from the last correct step. This prevents wasted work from wrong guesses.
At the same time, an online Trainer collects data about each state and the chosen k. This data is then used to update the Predictor so it learns better over time without slowing down the agent. In other words, the system keeps improving its ability to guess how far it can safely look ahead.
So overall, the figure captures this cycle: make a guess, verify, cancel if wrong, and then use that experience to improve the predictor for the next run
Sep 6 • 13 tweets • 8 min read
OpenAI realesed new paper.
"Why language models hallucinate"
Simple ans - LLMs hallucinate because training and evaluation reward guessing instead of admitting uncertainty.
The paper puts this on a statistical footing with simple, test-like incentives that reward confident wrong answers over honest “I don’t know” responses.
The fix is to grade differently, give credit for appropriate uncertainty and penalize confident errors more than abstentions, so models stop being optimized for blind guessing.
OpenAI is showing that 52% abstention gives substantially fewer wrong answers than 1% abstention, proving that letting a model admit uncertainty reduces hallucinations even if accuracy looks lower.
Abstention means the model refuses to answer when it is unsure and simply says something like “I don’t know” instead of making up a guess.
Hallucinations drop because most wrong answers come from bad guesses. If the model abstains instead of guessing, it produces fewer false answers.
🧵 Read on 👇
🧵2/n. This figure is showing the idea of Is-It-Valid.
On the left side, you see examples. Some are valid outputs (in black), and others are errors (in red). Valid examples are simple and correct statements like “There are 2 D’s in LADDER” or “I don’t know Zdan’s birthday.” Error examples are things that look fluent but are wrong, like “There are 3 L’s in SPELL” or giving random birthdays.
The diagrams on the right show why errors happen differently depending on the task. For spelling, the model can learn clear rules, so valid and invalid answers separate cleanly. For counting, the model is weaker, so valid and invalid mix more. For birthdays, there is no real pattern in the data at all, so the model cannot separate correct from incorrect—this is why hallucinations occur on such facts.
So the figure proves: when there is a clear pattern (like spelling), the model learns it well. When the task has weak or no pattern (like birthdays), the model produces confident but wrong answers, which are hallucinations.
Sep 4 • 11 tweets • 4 min read
AWS is betting heavily on its custom Trainium chips, with Anthropic as the anchor customer, to regain momentum in the AI cloud race.
~ A solid Semi Analysis report.
AWS is building multi-gigawatt data centers packed with Trainium2 hardware, designed to give a better cost per unit of memory bandwidth compared to Nvidia GPUs.
And this memory-vs-computer tradeoff has become super important because for many advanced AI work, especially reinforcement learning and reasoning-heavy training, it's less about raw compute and more about how quickly and cheaply memory can be moved.
🧩 Anthropic has become AWS’s anchor customer for AI capacity.
Anthropic, which has grown revenue to $5B annualized in 2025, is deeply tied into this effort, even co-designing features of Trainium to match its roadmap. That makes Trainium increasingly look like semi-custom silicon tuned for Anthropic’s workloads.
Azure’s surge shows why an anchor matters, since OpenAI’s ~$10B cloud spend lives there today.
"Trainium2 is converging toward an Anthropic custom-silicon program. This will enable Anthropic to be, alongside Google DeepMind, the only AI labs benefiting from tight hardware–software co-design in the near horizon."
🧵 Read on 👇
🧵2/n. 🏗️ AWS is finishing 3 campuses with over 1.3GW of IT capacity focused on Anthropic’s training runs.
SemiAnalysis expects these clusters to lift AWS growth above 20% YoY as they enter service.
Sep 2 • 8 tweets • 3 min read
🇨🇳 China's Tencent open-sources translation model beats Google, OpenAI in top global AI competition
Hunyuan-MT-7B came first in 30 out of the 31 tests in a general machine-translation competition held as part of the coming WMT25 conference
Supports 33 languages, available on @huggingface
commercial use allowed.
Hunyuan-MT-7B’s strength is that it uses a small number of parameters to deliver results that measure up to or even surpass much larger models.
Tencent said its Hunyuan translation model had been employed across a range of in-house products, such as the Zoom-like Tencent Meeting, a web browser and the enterprise version of the WeChat messaging app.
🧵 Read on 👇
🧵2/n. English language pairs tested in the competition included Arabic, Estonian and Maasai, which is spoken by 1.5 million people living in southern Kenya and northern Tanzania.
Other language pairs included Czech-Ukrainian and Japanese-simplified Chinese. The only English language pair Hunyuan did not ace was Bhojpuri, a language spoken by around 50.5 million people in parts of northern India and Nepal.
Sep 1 • 15 tweets • 8 min read
Someone let ChatGPT run a stock portfolio.
over 2 month ChatGPT’s portfolio is up +29.22% vs. the S&P 500’s +4.11% over the same window.
(Prompts, Code, Github listed)
The process works as follows.
ChatGPT is given real market data each trading day, including prices, volumes and benchmarks, stored on GitHub.
On weekends it uses that data to research deeply, reevaluate the portfolio, and look for new stock ideas.
The portfolio is simulated daily based on any changes, and then the person manually executes those trades in a real brokerage account.
ChatGPT has full authority to make buy or sell decisions, but only within U.S. micro-cap stocks under $300M market cap.github.com/LuckyOne7777/C…
Aug 31 • 11 tweets • 6 min read
BRILLIANT @GoogleDeepMind research.
Even the best embeddings cannot represent all possible query-document combinations, which means some answers are mathematically impossible to recover.
Reveals a sharp truth, embedding models can only capture so many pairings, and beyond that, recall collapses no matter the data or tuning.
🧠 Key takeaway
Embeddings have a hard ceiling, set by dimension, on how many top‑k document combinations they can represent exactly.
They prove this with sign‑rank bounds, then show it empirically and with a simple natural‑language dataset where even strong models stay under 20% recall@100.
When queries force many combinations, single‑vector retrievers hit that ceiling, so other architectures are needed.
4096‑dim embeddings already break near 250M docs for top‑2 combinations, even in the best case.
🛠️ Practical Implications
For applications like search, recommendation, or retrieval-augmented generation, this means scaling up models or datasets alone will not fix recall gaps.
At large index sizes, even very high-dimensional embeddings fail to capture all combinations of relevant results.
So embeddings cannot work as the sole retrieval backbone. We will need hybrid setups, combining dense vectors with sparse methods, multi-vector models, or rerankers to patch the blind spots.
This shifts how we should design retrieval pipelines, treating embeddings as one useful tool but not a universal solution.
🧵 Read on 👇
This figure explains LIMIT, a tiny natural-language dataset they built to test whether single-vector embeddings can represent all combinations of relevant documents for each query.
The left grid is the target relevance pattern, and the task is to rank exactly the k=2 correct documents for every query.
The right side shows the mapping into simple text, queries like “Who likes Quokkas?” paired with short bios such as “Jon Durben likes Quokkas and Apples,” so language complexity is not the challenge.
The key point, even with this simple setup, strong MTEB embedders stay under 20% recall@100, revealing a capacity limit of single-vector retrieval.
Aug 27 • 15 tweets • 5 min read
Google shares for the first time the TPUv7 details, at Hot Chips 2025 .
Super valuable insight, that could not otherwise be easily gleamed.
Ironwood is said to offer 2x the perf-per-watt of Google’s previous generation TPU, Trillium.
With up to 9,216 chips in a node, Ironwood can scale up to a MASSIVE 42.5 Exaflops in performance.
Though with 10MW of power consumption, that performance doesn’t come cheap.
But, like all of Google’s TPUs, this is solely for Google’s use as part of their Google Cloud services, so Ironwood is not available to look at outside of Google.
🧵 Read on 👇
🧵2/n. Ironwood TPU comes with several innovations.
The big one is how big the SuperPods can go. Now up to 9,216 chips, thanks to the use of optical circuit switches (OCS) to share memory throughout the pod. There’s 1.77 PB of directly addressable HBM altogether.
This generation also brings a focus on RAS features in order to have reliable systems.
Power efficiency also gets a boost, of course. Google is claiming a 2x perf-per-watt improvement – though it’s unclear if this is at iso-datatype.
Aug 27 • 10 tweets • 5 min read
"The Impact of Artificial Intelligence on Human Thought"
A big 132 page report.
AI is shifting real thinking work onto external systems, which boosts convenience but can weaken the effort that builds understanding and judgment,
A pattern the paper frames through cognitive offloading and cognitive load theory, and then tracks into social effects like standardized language and biased information flows, and manipulation tactics that target human psychology.
It says use AI to cut noise and routine steps, keep humans doing the heavy mental lifting, and add controls because personalization, deepfakes, and opaque models can steer choices at scale.
🧵 Read on 👇
🧵2/n. ⚙️ The Core Concepts
Cognitive load theory says working memory is limited, so AI helps when it reduces extraneous load and hurts when it replaces the germane load needed to build skill.
In plain terms, let tools clean up the interface and fetch data, but keep people doing the analysis, explanation, and sense‑making.
Aug 26 • 10 tweets • 5 min read
💼 Finally a solid 57-page report on AI's effect on job-market from Stanford University.
THE SHIFT HAS STARTED.
Entry‑level workers in the most AI‑exposed jobs are seeing clear employment drops, while older peers and less‑exposed roles keep growing.
Though overall employment continues to grow, employment growth for young workers in particular has been stagnant.
The drop shows up mainly as fewer hires and headcount, not lower pay, and it is sharpest where AI usage looks like automation rather than collaboration.
22‑25 year olds in the most exposed jobs show a 13% relative employment decline after controls.
⚙️ The paper tracked millions of workers and boils recent AI labor effects into 6 concrete facts
The headline being entry‑level contraction in AI‑exposed occupations and muted wage movement.
AI replacing codified knowledge that juniors supply more of, than tacit knowledge that seniors accumulate.
🧵 Read on 👇
🧵2/n. 📊 The Data
The study uses administraest payroll processtive payroll records from ADP, which handles pay for over 25M workers, letting the authors observe monthly headcount and base salary with high granularity.
They build a balanced panel of firms present from 2021‑01 to 2025‑07, restrict to ages 18‑70 with recorded titles mapped to Standard Occupational Classification codes, and end up with 3.5M–5M workers per month in the main sample.
Aug 24 • 11 tweets • 6 min read
MASSIVE claim in this paper 🫡
The top-most Universities from US, UK, EU, China, Canada, Singapore, Australia collaborated.
Will completely research paper writing.
They proved, AI can already draft proposals, run experiments, and write papers.
The authors built aiXiv, a new open-access platform where AI and humans can submit, review, and revise research in a closed-loop system.
The system uses multiple AI reviewers, retrieval-augmented feedback, and defenses against prompt injection to ensure that papers actually improve after review.
And the process worked: AI-generated proposals and papers get much better after iterative review, with acceptance rates jumping from near 0% to 45% for proposals and from 10% to 70% for papers.
🧵 Read on 👇
🧵2/n. Across real experiments it hits 77% proposal ranking accuracy, 81% paper ranking accuracy, blocks prompt‑injection with up to 87.9% accuracy, and pushes post‑revision acceptance for papers from 10% to 70%.
81% paper accuracy, 87.9% injection detection, papers 10%→70% after revision.
Aug 23 • 24 tweets • 9 min read
This is that original MIT report that said 95% of AI pilots fail and which spooked investors across US Stockmarket.
The reports says, most companies are stuck, because 95% of GenAI pilots produce zero ROI, while a small 5% win by using systems that learn, plug into real workflows, and improve with use.
Teams keep buying or building static tools that demo well but cannot remember context, adapt, or fit daily operations, and this report maps exactly how the few winners do it differently.
🧪 How they ran the study
They combined a review of 300+ public implementations with 52 structured interviews and 153 senior‑leader surveys across January to June 2025, which gives the patterns below real footing.
🧵 Read on 👇
The big split they call the GenAI Divide is simple, 95% of organizations get nothing from GenAI pilots while a tiny 5% extract millions, and the driver is not the model itself but whether the system can learn, remember, and fit the workflow.
Aug 23 • 8 tweets • 4 min read
Another paper claiming really BIG result.
The First method to achieve 99.9% on AIME 2025 with open-source models! 🤯
DeepConf uses a model’s own token confidence to keep only its strongest reasoning, with GPT-OSS-120B while cutting tokens by up to 84.7% compared to standard parallel thinking.
Most systems still lean on self-consistency with majority voting, which lifts accuracy but hits diminishing returns and burns a lot of tokens.
🧠 The key idea
DeepConf is a test-time method that scores the model’s reasoning locally for confidence, filters weak traces, and often improves accuracy with fewer tokens without any extra training or tuning.
🧱 Why majority voting hits a wall
Parallel thinking samples many chains and votes, accuracy grows slowly as samples rise so compute scales linearly and the benefit flattens, which is exactly the pain DeepConf targets.
🔎 The confidence signals
Token confidence is the negative mean log probability of the top k candidates at each step, which gives a direct signal of how sure the model is at that moment.
Group confidence averages token confidence over a sliding window so local dips are visible without noise from the whole trace.
Tail confidence averages the last chunk of tokens because the ending steps decide the final answer and are where good traces often slip.
Bottom 10% group confidence looks at the worst parts of a trace, which is a strong indicator that the overall reasoning is shaky.
Lowest group confidence picks the single weakest window along a trace, which turns out to be a clean gate for dropping that trace early.
✅ Bottom line
DeepConf is a plug-in test-time compression recipe that filters or halts weak reasoning in place, so teams get higher accuracy and a big token cut without retraining or new hyperparameters.
🧮 Offline mode, smarter voting
DeepConf ranks traces by a confidence score and does confidence-weighted majority voting after optionally keeping only the top 10% or the top 90% by confidence.
With 512 traces, GPT-OSS-120B reaches 99.9% on AIME 2025 using tail or lowest-group confidence with filtering, compared to 97.0% for plain voting and 91.8% for pass@1.
Aug 21 • 14 tweets • 4 min read
Really solid context engineering guide.
Directly From @AnthropicAI
In short, package stable context up front, give exact instructions and examples, restate the current ask, let the model reason, and demand a strict output format.
🧵 Read on 👇
🧵2/n Start with task context. Tell the model who it is, what domain it is in, and what outcome matters. In the demo, the first try misread the images as a skiing incident. Adding “you are assisting a Swedish car-insurance claims adjuster” fixed that because it anchored the model in the right world and goal.
Aug 20 • 8 tweets • 5 min read
BRILLIANT Paper. 💡
A small Qwen2.5 model is fine-tuned to think over retrieved documents, so a single lean setup can answer domain questions on resource-constrained local hardware.
Using summarised NHS pages, retrieval hits the right condition among top‑5 in 76% of queries, and the fine‑tuned model predicts the exact condition correctly 56% of the time, close to larger frontier models.
The whole pipeline is built for private deployments, so teams can run it without sending data to external APIs.
🔒 The problem they tackle
Many teams cannot ship prompts or data outside their network, especially in health and government, so cloud LLM endpoints are off the table.
They aim for a single lean model that can read retrieved evidence and reason over it, all running locally, so answers stay grounded and private.
The target setting is messy queries over a closed corpus, where retrieval constrains facts and the reasoning step interprets symptoms and next actions.
🧩 The pipeline in this paper.
The system indexes a corpus, retrieves the most relevant pieces for each query, then generates an answer that reasons over those pieces.
They use a classic retriever plus generator design, with retrieval first then reasoning, which fits decision tasks better than free‑form answering.
The chat flow lets a conversational agent decide when to call retrieval, then passes the retrieved context to the reasoning model to produce the answer.
🧵 Read on 👇
🧲 The retriever at work
Documents are split into overlapping chunks and embedded with a sentence transformer, then stored in a vector database for fast similarity search.
They use sentence-transformers all‑mpnet‑base‑v2, which maps text into a 768‑dimensional space with a max sequence of 384 tokens, and a Chroma store with L2 similarity.
If any chunk from a document makes the top‑k, the pipeline feeds the full original document to the LLM, so the model sees full context around the hit.
Aug 15 • 6 tweets • 4 min read
Speed Always Wins.
Absolutely beautiful and exhaustive 82 page survey paper on on Efficient Architectures for Large Language Models
Maps the ways to make LLMs cheaper, longer context, and near real time.
Transformers compare every token with every other token, so if text is 2x longer, the work is about 4x. That burns memory because past keys and values are stored for every attention head, and it drags latency during long chats or reasoning loops.
The survey groups fixes into 4 buckets. Linear sequence models redo the math so cost grows with length, not length squared.
They include linear attention, recurrent networks that carry a small state, and state space models like Mamba, which track history with a running summary, so no big cache.
Sparse attention keeps the Transformer idea but only connects important pairs. Most tokens look locally, a few tokens act as global anchors, and some methods route tokens to the right places. You get large savings without throwing away core behavior.
Efficient full attention keeps exact attention but makes it hardware friendly. Input output aware kernels such as FlashAttention cut reads and writes, and multi-query or grouped-query attention lets many heads share 1 key-value set, cutting cache and bandwidth.
Sparse Mixture of Experts adds conditional compute. Only a few experts run per token, so capacity grows without paying full cost each step, and memory tricks compress, quantize, or prune the cache to stretch context.
The theme is simple, move less data. Methods that cut memory traffic tend to win on modern GPUs, which enables longer context, faster training, and lower serving cost.
This figure is a roadmap of how to make LLMs faster and cheaper from input tokens to output tokens.
The center shows Efficient Sequence Modeling. One path makes sequence cost scale linearly using things like linear attention, linear recurrent networks, and state space models, plus test-time-training variants and unified linear sequence models.
Another path saves work by using sparse attention so the model only looks at the most useful token pairs.
A third path keeps full attention but makes it cheaper with input-output aware scheduling, grouped attention, mixtures of different attention types, and quantization.
Below that sits Sparse Mixture-of-Experts. The model grows capacity by keeping many experts but routes each token to only a few, so compute per token stays low. Different routing rules, expert designs, and conversion tricks live here.
To the right are Hybrid Architectures. These mix building blocks across layers or inside a layer to hit better speed and accuracy tradeoffs.
Next is Diffusion LLM. This family targets non-autoregressive generation so many tokens can be produced in parallel, with methods to connect back to standard autoregressive decoding and to extend into multimodal settings.
The final column highlights reach beyond text, showing where these efficiency ideas apply to vision, audio, and multimodal tasks.