Rohan Paul Profile picture
Nov 5 3 tweets 3 min read Read on X
🏗️ Hardware Memory bandwidth is becoming the choke point slowing down GenAI.

During 2018–2022, transformer model size grew ~410× every 2 years, while memory per accelerator grew only about 2× every 2 years.

And that mismatch shoves us into a “Memory-Wall”

The "memory wall" is creating all the challenges in the datacenter and for edge AI applications.

In the datacenter, current technologies are primarily trying to solve this problem by applying more GPU compute power. And that's why HBM capacity and bandwidth scaling, KV offload, and prefill-decode disaggregation are central to accelerator roadmaps.

Still, at the edge, quite frankly, there are no good solutions.

🚫 Bandwidth is now the bottleneck (not just capacity).

Even when you can somehow fit the weights, the chips can’t feed data fast enough from memory to the compute units.

Over the last ~20 years, peak compute rose ~60,000×, but DRAM bandwidth only ~100× and interconnect bandwidth ~30×. Result: the processor sits idle waiting for data—the classic “memory wall.”

This hits decoder-style LLM inference the hardest.

Becasue decoder-style LLMs generate 1 token at a time, so each step reuses the same weights but must stream a growing KV cache from memory. That makes the arithmetic intensity low, since you move a lot of bytes per token relative to FLOPs.

As the context grows, the KV cache grows linearly with sequence length and layer count, so every new token has to read more KV tensors, hence the KV cache quickly dominates bytes moved.

And thats why so much of recent research focus on reducing or reorganizing KV movement rather than adding FLOPs.

Training often needs 3–4X more memory than just the parameters because you must hold parameters, gradients, optimizer states and activations.

Hence we have this huge bandwidth gap: Moving weights, activations, and KV-cache around chips/GPUs is slower than the raw compute can consume.

Together, these dominate runtime and cost for modern LLMs.

🧵 Read on 👇Image
🧵2/n. AI and Memory Wall

The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increasingly shifting to memory bandwidth.Image
🧵3/n. We can see the huge growth in HBM (High Bandwidth Memory) bit demand that has come alongside AI accelerator demand.

This is a direct manifestation of that "Memory Wall"

As models grow, we need more bits of HBM just to store weights, activations, and KV caches. That is pushing us from “memory capacity limited” into “memory bandwidth limited” if we can’t feed the compute units fast enough.

Every increase in model size, longer context, bigger batch, or more parameters corresponds to more bytes to move. That increases pressure on DRAM/HBM interfaces and interconnects. The more HBM capacity is added, the more potential bandwidth you must deliver to fully utilize it.

You can’t freely scale HBM bandwidth linearly with capacity. Physical limits (pin count, thermal, power, cost) constrain how much bandwidth can be added. So even though bit demand explodes, the ability to feed those bits hasn’t kept pace, reinforcing the memory wall.

When a single GPU demands 1 TB of HBM, it's not just memory chips you need. The entire stack—HBM interface, interconnect, packaging, cooling—all must scale to deliver that throughput.Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Rohan Paul

Rohan Paul Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rohanpaul_ai

Nov 2
16 charts that explain the AI boom

Quite insightful blog by Kai Williams. 👏

1. The largest technology companies are investing heavily in AI Image
2. AI spending is significant in historical terms Image
3. Companies are importing a lot of AI chips Image
Read 17 tweets
Oct 31
🧠 "The Impact of Artificial Intelligence on Human Thought"

A big 132 page report.

AI is shifting real thinking work onto external systems, which boosts convenience but can weaken the effort that builds understanding and judgment,

Says, with AI's help Cognitive offloading cuts the mental work people invest in tasks, which boosts convenience in the moment but can weaken critical thinking and creativity over time.

With AI, personalized feeds lock users into filter bubbles, so views polarize across groups while language and reasoning become more uniform inside each group.

It recommends, use AI to cut noise and routine steps, but keep humans doing the heavy mental lifting, and add controls because personalization, deepfakes, and opaque models can steer choices at scale.

🧵 Read on 👇Image
🧵2/n. ⚙️ The Core Concepts

Cognitive load theory says working memory is limited, so AI helps when it reduces extraneous load and hurts when it replaces the germane load needed to build skill.

In plain terms, let tools clean up the interface and fetch data, but keep people doing the analysis, explanation, and sense‑making.Image
🧵3/n. 🧰 Offloading and memory

Handing memory, calculation, or choosing to an external aid frees attention now, yet steady offloading can dull recall and critical habits later.

The paper casts web search, note apps, and assistants as a human‑machine transactive memory system, useful when sources are reliable, risky when they are biased or wrong.

That is why trust and verification routines matter as much as speed.Image
Read 11 tweets
Oct 30
The Jetson ONE personal air vehicle costs $128K.

You can fly it without a pilot license after a quick 5-day training course.

Needs an $8,000 deposit. comes with backup batteries, a ballistic parachute, and radar that handles auto-landing.

On the Safety features - Jetson include the ability to keep flying after 1 motor failure, hands-free hover and emergency functions, redundant battery propulsion, a ballistic parachute with rapid deployment, and a radar-sensor auto-landing system. Jetson also published a separate update on its airframe parachute system with test deployments.

Jetson published range testing that repeatedly achieved 11.02 miles at a cruise speed of 60 km/h, consistent with about a 18-20 minute endurance window depending on conditions and pilot weight.
In the US this fits FAA Part 103 ultralight rules, which means no pilot license and no aircraft registration. Operations are limited to daylight or civil-twilight with a strobe, not over congested areas, and not in controlled airspace without ATC authorization.
Read 4 tweets
Oct 15
BIG success for LLMs in financial trading & decision making.

New Stanford + Univ California study proves a 4B financial-domain model, Trading-R1, writes clear analyst theses and turns them into profitable trades.

Its trained on 100K cases over 18 months across 14 tickers, and its backtests show better risk-adjusted returns with smaller drawdowns.

The problem it tackles is simple, quant models are hard to read, and general LLMs write nice text that does not translate into disciplined trades.

The solution starts by forcing a strict thesis format, with separate sections for market data, fundamentals, and sentiment, and every claim must point to evidence from the given context.

Then it learns decisions by mapping outcomes into 5 labels, strong buy, buy, hold, sell, strong sell, using returns that are normalized by volatility over several horizons.

For training, it first copies high-quality reasoning distilled from stronger black-box models using supervised fine-tuning, then it improves with a reinforcement method called group relative policy optimization.

In held-out tests on NVDA, AAPL, AMZN, META, MSFT, and SPY, the combined approach beats small and large baselines on Sharpe and max drawdown, and the authors position it as research support, not high-frequency automation.

🧵 Read on 👇Image
🧵2/n. The 3 steps used to train Trading-R1.

The first step is Structure. The model is taught how to write a thesis in a clear format. It must separate parts like market trends, company fundamentals, and sentiment, and it has to place each claim in the right section.

The second step is Claims. Here the model learns that any claim it makes must be supported by evidence. For example, if it says revenue is growing, it must back that with a source or number provided in the context.

The third step is Decision. The model turns the structured thesis into an actual trading action. It predicts outcomes like strong buy, buy, hold, sell, or strong sell. Its prediction is checked against the true outcome, and it gets rewards or penalties depending on accuracy.

Each step first uses supervised fine-tuning, which means training on examples with correct answers, and then reinforcement fine-tuning, which means refining the model by giving rewards when it produces better outputs.

Finally, all stages are combined, producing Trading-R1, a model that can both write well-structured financial reasoning and map that reasoning into actual trading decisions.Image
🧵3/n. Three-Stage Financial Trading Model Training Pipeline

In Structure, the model learns to write in a clear format and keep sections organized.

In Claims, it learns to back every statement with quotes or sources, reducing hallucinations.

In Decision, it learns to turn the structured reasoning into buy, hold, or sell calls that are market-aware.

Each stage mixes supervised fine-tuning, reinforcement fine-tuning, and filtering of good examples to steadily improve.Image
Read 7 tweets
Oct 12
"The Impact of Artificial Intelligence on Human Thought"

A big 132 page report.

AI is shifting real thinking work onto external systems, which boosts convenience but can weaken the effort that builds understanding and judgment,

A pattern the paper frames through cognitive offloading and cognitive load theory, and then tracks into social effects like standardized language and biased information flows, and manipulation tactics that target human psychology.

It says use AI to cut noise and routine steps, keep humans doing the heavy mental lifting, and add controls because personalization, deepfakes, and opaque models can steer choices at scale.

🧵 Read on 👇Image
🧵2/n. ⚙️ The Core Concepts

Cognitive load theory says working memory is limited, so AI helps when it reduces extraneous load and hurts when it replaces the germane load needed to build skill.

In plain terms, let tools clean up the interface and fetch data, but keep people doing the analysis, explanation, and sense‑making.Image
🧵3/n. 🧰 Offloading and memory

Handing memory, calculation, or choosing to an external aid frees attention now, yet steady offloading can dull recall and critical habits later.

The paper casts web search, note apps, and assistants as a human‑machine transactive memory system, useful when sources are reliable, risky when they are biased or wrong.

That is why trust and verification routines matter as much as speed.Image
Read 11 tweets
Oct 10
Rude prompts to LLMs consistently lead to better results than polite ones 🤯

The authors found that very polite and polite tones reduced accuracy, while neutral, rude, and very rude tones improved it.

Statistical tests confirmed that the differences were significant, not random, across repeated runs.

The top score reported was 84.8% for very rude prompts and the lowest was 80.8% for very polite.

They compared their results with earlier studies and noted that older models (like GPT-3.5 and Llama-2) behaved differently, but GPT-4-based models like ChatGPT-4o show this clear reversal where harsh tone works better.

----

Paper – arxiv. org/abs/2510.04950

Paper Title: "Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper)"Image
Average accuracy and range across 10 runs for five different tones Image
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(