Rohan Paul Profile picture
Jun 17, 2025 12 tweets 4 min read Read on X
It’s a hefty 206-page research paper, and the findings are concerning.

"LLM users consistently underperformed at neural, linguistic, and behavioral levels"

This study finds LLM dependence weakens the writer’s own neural and linguistic fingerprints. 🤔🤔

Relying only on EEG, text mining, and a cross-over session, the authors show that keeping some AI-free practice time protects memory circuits and encourages richer language even when a tool is later reintroduced.Image
⚙️ The Experimental Setup

Fifty-four Boston-area students wrote SAT-style essays under three conditions: ChatGPT only, Google only, or brain only.

Each person completed three timed sessions with the same condition, then an optional fourth session in the opposite condition.

A 32-channel Enobio headset recorded brain signals throughout, and every keystroke, prompt, and interview answer was archived for analysis.Image
🧠 Brain Connectivity Results

Alpha and beta networks were strongest when no external tool was allowed, moderate with Google, and weakest with ChatGPT.

Lower coupling during LLM use signals reduced internal attention and memory rehearsal, while high parieto-frontal flow in the brain-only group matches deep semantic processing.Image
📚 Linguistic Patterns

Essays produced with ChatGPT clustered tightly in embedding space and reused the same named entities, showing high textual homogeneity.

Google essays sat in the middle, influenced by search rankings, whereas brain-only essays scattered widely, reflecting individual experience and vocabulary.Image
📝 Memory and Ownership

After writing, only 17 % of ChatGPT users could quote their own sentences, versus 89 % in the brain-only group.

ChatGPT writers also reported the weakest sense of authorship, matching EEG evidence of reduced self-monitoring hubs. Image
🔄 Crossover Effects

When habitual ChatGPT users had to write unaided, their connectivity and quoting remained low, suggesting lingering cognitive debt.

In contrast, brain-only writers who switched to ChatGPT lit up wide networks and produced richer revisions, showing that tool use after deep practice boosts, rather than blunts, engagement.Image
⚖️ Cognitive Load Implications

LLMs cut extraneous load by 32 % and extend productive time, yet they also trim germane load, so schema building suffers unless learners deliberately integrate ideas themselves. Image
🔍 Echo-Chamber Risk

Because a probabilistic model favors agreeable continuations, ChatGPT can tighten information loops more than a search page, shrinking exposure to contrasting facts and dulling critical thought.

Hooking sentence options Image
Percentage of participants within each group who struggled to quote anything from their essays Image
Percentage of participants within each group who provided a correct quote from their essays in Session Image
Relative reported percentage of perceived ownership of essay by the participants in comparison to the
Brain-only group Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Rohan Paul

Rohan Paul Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rohanpaul_ai

Mar 5
Yann LeCun's (@ylecun ) new paper along with other top researchers proposes a brilliant idea. 🎯

Says that chasing general AI is a mistake and we must build superhuman adaptable specialists instead.

The whole AI industry is obsessed with building machines that can do absolutely everything humans can do.

But this goal is fundamentally flawed because humans are actually highly specialized creatures optimized only for physical survival.

Instead of trying to force one giant model to master every possible task from folding laundry to predicting protein structures, they suggest building expert systems that learn generic knowledge through self-supervised methods.

By using internal world models to understand how things work, these specialized systems can quickly adapt to solve complex problems that human brains simply cannot handle.

This shift means we can stop wasting computing power on human traits and focus on building diverse tools that actually solve hard real-world problems.

So overall the researchers here propose a new target called Superhuman Adaptable Intelligence which focuses strictly on how fast a system learns new skills.

The paper explicitly argues that evolution shaped human intelligence strictly as a specialized tool for physical survival.

The researchers state that nature optimized our brains specifically for tasks necessary to stay alive in the physical world.

They explain that abilities like walking or seeing seem incredibly general to us only because they are absolutely critical for our existence.

The authors point out that humans are actually terrible at cognitive tasks outside this evolutionary comfort zone, like calculating massive mathematical probabilities.

The study highlights how a chess grandmaster only looks intelligent compared to other humans, while modern computers easily crush those human limits.

This proves their central point that humanity suffers from an illusion of generality simply because we cannot perceive our own biological blind spots.

They conclude that building machines to mimic this narrow human survival toolkit is a deeply flawed way to create advanced technology.Image
This visual maps different AI goals to show how adaptable intelligence completely beats older performance ideas.

Traditional targets only focus on copying human jobs.

The new framework prioritizes fast learning across important tasks.

It targets high adaptability over static performance.

Specialized experts easily beat systems mimicking rigid human behavior.

----

Paper arxiv.org/abs/2602.23643…

"AI Must Embrace Specialization via Superhuman Adaptable Intelligence"Image
This table shows why the most famous definitions for AGI are totally flawed.

It shows they fail because they wrongly assume human minds are perfectly general, demand impossible computing resources, or simply cannot be tested.

i.e. the industry needs a smarter goal. Image
Read 6 tweets
Feb 28
The US government’s declaration for Anthropic as a "supply chain risk" could have massive, existential ripple effects on it, and across the entire tech industry.

Defense Secretary Pete Hegseth explicit directive states that
"Effective immediately, no contractor, supplier, or partner that does business with the United States military may conduct any commercial activity with Anthropic."

NOW THE PROBLEM IS - Every single major cloud provider in the United States is officially a defense contractor.

Because Anthropic does not own data centers, they rely entirely on providers like AWS and Google Cloud to train and run their models.

This new government decree forces those cloud giants into a brutal ultimatum. If forced to choose between multi-billion-dollar defense contracts and hosting a single AI company, these hyperscalers (cloud providers) will undeniably choose to protect their Pentagon ties. It is highly unlikely they will jeopardize their standing in the JWCC just to keep Anthropic online.

As per December 2022 official press release by Department of War, the Joint Warfighting Cloud Capability (JWCC)—a massive, multi-billion-dollar initiative awarding cloud computing contracts to Amazon Web Services (AWS), Google Cloud, Microsoft Azure, and Oracle.

---

Now some possibilities.

1. The Literal Threat is Real, but Unprecedented

If the DoD enforces this decree exactly as written, it acts as a total "secondary boycott." Historically, the U.S. government uses "supply chain risk" designations for foreign adversaries (like Chinese telecom giant Huawei or Russian software firm Kaspersky).

Applying this to a domestic U.S. company valued at $380B is entirely unprecedented.

2. Historically, when a company is deemed a supply chain risk, the law dictates that government contractors cannot use the blacklisted technology in their own internal networks, nor can they resell it to the government.

For example, Microsoft and Amazon would be barred from offering Anthropic's Claude to federal agencies or using Claude to write code for defense projects. However, a traditional blacklist does not usually prevent a contractor from simply selling generic cloud hosting services to the blacklisted entity in a completely separate commercial capacity.

3. A total decoupling of Anthropic from the world's major cloud providers would face massive legal and logistical hurdles.
Banning hyperscalers from simply selling server space to Anthropic would represent a dramatic expansion of federal procurement power.

However, the risk still remains. Unless the Pentagon legally exempts basic server hosting from their definition of "commercial activity," Anthropic may face an imminent and total infrastructure blackout.
Image
The video from 'Interesting Times with Ross Douthat + New York Times Podcasts + New York Times Opinion' YT Channle

Read 6 tweets
Feb 21
NanoClaw, the lightweight alternative to Clawdbot / OpenClaw already reached 10.5K Github stars ⭐️

Compared with OpenClaw, NanoClaw’s specialty is simplicity plus OS level isolation.

- Much smaller and manageable codebase, only 4K lines.
- Runs in containers for security.
- Connects to WhatsApp, has memory, scheduled jobs, and runs directly on Anthropic's Agents SDK
- stores state in SQLite, runs scheduled jobs, and keeps each chat group isolated with its own memory file and its own Linux container so the agent only sees directories you explicitly mount.
- its safety model leans on application controls like allowlists and pairing codes inside a shared Node process.

OpenClaw is built for broad multi channel coverage, while NanoClaw intentionally stays minimal so you customize by changing a small codebase instead of operating a big framework.Image
nanoclaw's philosophy Image
Read 5 tweets
Feb 12
An open-source 9B model with a 1M-token context and an Apache 2.0 license has just been released on Hugging Face. It’s designed to run on a single consumer-class GPU, such as the NVIDIA RTX 5090.

This model breaks the "Compute Wall" and the "Memory Wall," achieving 3.5× faster inference and significantly lower KV-cache overhead compared to dense baselines.

This is no longer an "either-or" choice between performance and efficiency.

How?
Full Attention mechanism's computational complexity grows quadratically with length, making edge-side long-text inference "slow and memory-intensive."

Solution: MiniCPM-SALA adopts a golden ratio of 75% Linear Attention + 25% Sparse Attention.

MiniCPM-SALA (9B) is OpenBMB’s long-context model aimed at running 1M to 2M tokens on a single GPU without the memory spikes and OOM failures common with dense full attention. The main idea is a sparse plus linear hybrid that keeps long-range recall accurate while keeping cost manageable as context grows.

- Architecturally, about 25% of layers use InfLLM-V2 style sparse attention for high-fidelity long-range retrieval, while about 75% use Lightning linear attention, so compute scales close to linearly with sequence length. Instead of a uniform interleave, the sparse layers are placed via a 1:3 layer-selection pattern.

- For positional handling and stability, SALA uses hybrid positional encoding (HyPE): RoPE stays in the linear layers but is removed in sparse layers to avoid long-range decay, and it adds QK-normalization plus output gating to improve stability and reduce attention-sink behavior.

- Training is done by converting a pretrained Transformer, not training from scratch. It starts from a MiniCPM-4.0 intermediate checkpoint trained on 7T tokens, then applies HALO conversion, keeping the 1st and last layers unconverted and initially training only the converted linear layers.

Conversion plus post-training totals about 2T tokens, framed as about a 75% cost reduction versus an 8T scratch run, with context ramping from 512 to 4K, then to 32K, 160K, and 520K, followed by SFT at 64K and 140K.

Reported results keep standard performance strong (76.53 average, HumanEval 95.12, AIME24 83.75, AIME25 78.33)

While improving long-context behavior (RULER 92.65 at 64K, 89.37 at 128K). It also reports single-GPU 1M-token inference where Qwen3-8B OOMs, 256K TTFT, improving from 180.8s to 51.6s, and RULER holding at 86.3 at 1M and 81.6 at 2M without YaRN.

Go to Hugging Face/GitHub to test the model capabilities yourself.Image
🧵 2. The diagram compares a standard Transformer attention block on the right with the “hybrid” replacement block on the left.

On the right, softmax attention needs to keep a big key value cache for every past token, so as the context gets huge, the GPU runs out of memory and also slows down.

On the left, most layers swap that attention for an RNN-style “mixer” that keeps a running state S_t, so the model carries a compressed summary forward instead of storing per-token history, which makes very long context much cheaper in memory and compute.

The numbered marks show small but important fixes they apply during their HALO conversion, mainly hybrid positional encoding (HyPE) plus a few stability tweaks so the hybrid layers behave like the original Transformer at short context but do not fall apart at long context.

MiniCPM-SALA applies the same core idea at scale, keeping only 25% heavier attention style layers and making 75% of layers use cheaper attention variants, and the project claims this makes 1M token inference practical on a single RTX 5090 because KV cache pressure drops hard.Image
🧵 3. “Hybridizing attention” can keep quality while cutting long context memory and latency.

MiniCPM-SALA is the productized version of that same idea

In the paper, the researchers take a dense Transformer family (Qwen3) and convert it into a hybrid model they call HypeNet using a distillation recipe called HALO (Hybrid Attention via Layer Optimization), then they show HypeNet keeps performance while using less memory and avoiding the long context slowdown and out-of-memory failure you see in dense attention.

Also, the hybrid model can push higher throughput at a given quality level, meaning it generates tokens faster for the same kind of task, while the dense baseline slows down.

The right plot shows that, as context grows toward 1M, the dense Qwen3 version runs out of GPU memory, but the hybrid version still runs and keeps time per output token much lower.

The key architectural reason is that most layers stop using full softmax attention that needs a large key value cache for every past token, and instead use a cheaper hybrid or linear style mixer plus positional encoding changes like HyPE, so long context does not break.

This is the same general idea MiniCPM-SALA is selling: keep only a smaller fraction of heavier attention layers and make most layers cheaper, which is why they claim 1M token inference on a single RTX 5090.Image
Read 9 tweets
Jan 14
DeepSeek's innovation level is really at another level.

Its new paper just uncovered a new U-shaped scaling law.

Shows that N-grams still matter. Instead of dropping them in favor of neural networks, they hybridize the 2. This clears up the dimensionality problem and removes a big source of inefficiency in modern LLMs.

Uncovers a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram).

Right now, even “smart” LLMs waste a bunch of their early layers re-building common phrases and names from scratch, because they do not have a simple built-in “lookup table” feature.

Mixture-of-Experts already saves compute by only running a few expert blocks per token, but it still forces the model to spend compute to recall static stuff like named entities and formula-style text.

Engram is basically a giant memory table that gets queried using the last few tokens, so when the model sees a familiar short pattern it can fetch a stored vector quickly instead of rebuilding it through many layers.

They implement that query using hashed 2-gram and 3-gram patterns, which means the model always does the same small amount of lookup work per token even if the table is huge.

The big benefit is that if early layers stop burning time on “static reconstruction,” the rest of the network has more depth left for real reasoning, and that is why reasoning scores go up even though this sounds like “just memory.”

The long-context benefit is also solid, because offloading local phrase glue to memory frees attention to focus on far-away relationships, and Multi-Query Needle-in-a-Haystack goes from 84.2 to 97.0 in their matched comparison.

The system-level big deal is cost and scaling, because they show you can offload a 100B memory table to CPU memory and the throughput drop stays under 3%, so you can add a lot more “stored stuff” without needing to fit it all on GPU memory.Image
🧩 The core problem

The paper splits language modeling into 2 jobs, deep reasoning that needs real computation, and local stereotyped patterns that are basically fast recall.

Transformers do not have a native lookup block, so they burn early attention and feed-forward layers to rebuild static stuff like multi-token entities and formulaic phrases.

That rebuild is expensive mainly because it eats sequential depth, meaning the model spends layers on trivia-like reconstruction before it even starts the harder reasoning steps.

Classical N-gram models already handle a lot of this local dependency work with cheap table access, so forcing a Transformer to relearn it through compute is a design mismatch.

Engram is their way of turning “lookup” into a first-class primitive that lives next to MoE, instead of being faked by extra neural layers.Image
Engram adds a huge hashed N-gram memory table that gets queried with a fixed amount of work per token, so early layers stop wasting compute rebuilding names and stock phrases.

They show the best results when about 20% to 25% of the sparse budget moves from experts into this memory, while total compute stays matched.

Engram hits 97.0 on Multi-Query Needle-in-a-Haystack, while the matched MoE baseline hits 84.2.Image
Read 9 tweets
Jan 10
Anthropic has launched improved safety classifiers aimed at stopping AI jailbreaks.

The key idea is to add a cheap “early warning” safety check that runs all the time, and only turn on a stronger safety check when something looks suspicious, while also judging the user prompt and the model’s reply together so hidden or coded jailbreak requests do not slip through.

A new “Constitutional Classifiers++” setup stops universal jailbreaks without making the model expensive or annoying to use, by using a cheap internal probe to screen everything and only escalating suspicious chats to a stronger context-aware classifier.

40x less compute than a full exchange classifier, and 0.05% refusals on production traffic.

They proposed an “ensemble cascade” where an exchange classifier reads the prompt and the model’s partial answer together to beat obfuscation tricks, but it runs rarely because a near-free linear probe, built off the model’s own internal activations, decides when escalation is needed.Image
Image
🧠 The idea

A “Constitutional Classifier” is a separate model whose whole job is to read a conversation and decide whether the assistant should continue or refuse.

“Constitutional” here means the classifier is trained against a written rule set that says what is allowed and what is disallowed, like helping with normal chemistry learning but not with chemical weapon steps.

The core deployment problem is that defenses need to be cheap and rarely block normal users, because even a tiny refusal rate becomes real friction at scale.

This paper treats compute cost and refusal rate as first-class constraints, not afterthoughts, and the whole design is built around that.Image
🚨 What broke in the last system

The earlier setup split the job into 2 separate checks, one model looked only at the user input, and another looked only at the assistant output.

That split sounds clean, but it creates a blind spot when the harmful meaning is distributed across the prompt and the reply, so neither side looks bad alone.

Reconstruction-style attacks exploit this by hiding a harmful request inside a larger harmless-looking blob, then asking the model to piece it back together while answering.

Obfuscation-style attacks exploit it by pushing the model to speak in code words, riddles, or substitutions that look safe if the output is judged without the prompt context.

Some of these attack styles also damage normal model capability, and the paper shows GPQA Diamond accuracy dropping from 74.2% to 32.3% under 1 such jailbreak pattern, which signals the attack is “expensive” but still not something to rely on.
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(