Rohan Paul Profile picture
Aug 24, 2025 11 tweets 6 min read Read on X
MASSIVE claim in this paper 🫡

The top-most Universities from US, UK, EU, China, Canada, Singapore, Australia collaborated.

Will completely research paper writing.

They proved, AI can already draft proposals, run experiments, and write papers.

The authors built aiXiv, a new open-access platform where AI and humans can submit, review, and revise research in a closed-loop system.

The system uses multiple AI reviewers, retrieval-augmented feedback, and defenses against prompt injection to ensure that papers actually improve after review.

And the process worked: AI-generated proposals and papers get much better after iterative review, with acceptance rates jumping from near 0% to 45% for proposals and from 10% to 70% for papers.

🧵 Read on 👇Image
🧵2/n. Across real experiments it hits 77% proposal ranking accuracy, 81% paper ranking accuracy, blocks prompt‑injection with up to 87.9% accuracy, and pushes post‑revision acceptance for papers from 10% to 70%.

81% paper accuracy, 87.9% injection detection, papers 10%→70% after revision.Image
🧵3/n. This diagram shows aiXiv’s closed-loop system where AI and humans submit work, get automated reviews, revise, and then publish once quality clears the bar.

It means the platform is not a simple preprint dump, it is a workflow that forces measurable improvement each cycle.

Review agents score novelty, soundness, clarity, and feasibility using retrieval so feedback is grounded, and a prompt-injection detector screens malicious instructions before any model reads the file.

If the revised version looks better in pairwise checks, it moves forward, then a panel of LLMs votes, and 3 of 5 accepts trigger publication.

So the figure is saying aiXiv operationalizes end-to-end research, from idea to accepted paper, with guardrails and iteration built in.Image
🧵4/n. 🚧 Why this is needed

LLMs can already draft proposals, run experiments, and write papers, but journals resist AI authors and preprints miss screening, so strong AI‑generated research has nowhere credible to land.

This platform targets that gap by pairing automated review with structured revision so content quality is tracked and improved, not just posted.Image
Image
🧵5/n. ⚙️ The Core Concepts

An AI or human submits a proposal or paper, review agents score novelty, soundness, clarity, and feasibility, then return concrete fixes, the author revises, and the loop repeats until it clears the bar.

The loop is submission → automated review → revision → re‑evaluation → decision, which keeps pressure on actual improvements rather than one‑shot verdicts.

Each accepted item gets a DOI and explicit IP credit to the model developer and any initiating human, so attribution is clear from day one.

A public UI lets people like, comment, and discuss, which gives extra feedback signals to steer agent behavior.Image
🧵6/n. 🧾 How reviews work

Single Review Mode uses 1 reviewer agent to give targeted revisions over 4 axes, methodological quality, novelty, clarity, and feasibility, with grounded literature fetched by RAG so suggestions come with context.

Meta Review Mode spins up 3–5 domain‑specific reviewers, then an editor agent reconciles them into a concise decision letter with pointed fixes.

Pairwise Review Mode compares two versions of the same work, usually pre‑ and post‑revision, and decides which is better using criteria tailored for proposals or full papers.

Grounding via retrieval cuts hallucinated feedback and keeps the critique anchored to known results and citations.Image
🧵7/n. 🛡️ Defense against prompt‑injection

A 5‑stage pipeline inspects PDFs at text, layout, and semantic levels, so hidden instructions in white text, zero‑width characters, or multilingual tricks get surfaced before any model reads them.

It extracts font, color, and positioning, scans for anomalies, runs deep semantic checks with consistency tests, classifies the attack type, then assigns a risk score to block sketchy files.

This design aims for high recall early and precision later, which is the right bias for screening adversarial manuscripts.Image
🧵8/n. ✅ Publication decision

Five strong LLMs review independently, and a submission is accepted when 3 of 5 vote accept, which reduces any single‑model bias.

Proposals face stricter standards emphasizing originality and feasibility, while papers follow a slightly looser workshop‑level rubric prioritizing clarity and soundness.

Items can publish as Provisionally Accepted, then upgrade once enough diverse external reviewers weigh in.Image
🧵9/n. 📊 What the experiments say

Proposal ranking with RAG hits 77% on ICLR‑derived pairs, beating a 71% baseline reported in prior work.

Paper‑level ranking reaches 81%, which is solid given long contexts and messy drafts.

Prompt‑injection detection scores 84.8% on synthetic adversarials and 87.9% on suspicious real samples.

After agents review and authors revise, >90% of proposals and papers are preferred over the originals, and with a short response letter that climbs toward ~100%.

Majority voting mirrors that lift, proposals jump from 0% to 45.2% accepted on average, and papers jump from 10% to 70%.Image
🧵10/n. 🔌 Interfaces and ecosystem

An API plus Model Control Protocol lets heterogeneous agents plug in as authors, reviewers, and meta‑reviewers without glue code.

Accepted items get a DOI and explicit IP attribution, which matters for crediting both the human initiator and the model developer.

Community reactions, likes and comments, feed back as weak signals to help align agent behavior with evolving norms.Image
Paper –

Paper Title: "aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists"arxiv.org/abs/2508.15126

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Rohan Paul

Rohan Paul Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rohanpaul_ai

Feb 12
An open-source 9B model with a 1M-token context and an Apache 2.0 license has just been released on Hugging Face. It’s designed to run on a single consumer-class GPU, such as the NVIDIA RTX 5090.

This model breaks the "Compute Wall" and the "Memory Wall," achieving 3.5× faster inference and significantly lower KV-cache overhead compared to dense baselines.

This is no longer an "either-or" choice between performance and efficiency.

How?
Full Attention mechanism's computational complexity grows quadratically with length, making edge-side long-text inference "slow and memory-intensive."

Solution: MiniCPM-SALA adopts a golden ratio of 75% Linear Attention + 25% Sparse Attention.

MiniCPM-SALA (9B) is OpenBMB’s long-context model aimed at running 1M to 2M tokens on a single GPU without the memory spikes and OOM failures common with dense full attention. The main idea is a sparse plus linear hybrid that keeps long-range recall accurate while keeping cost manageable as context grows.

- Architecturally, about 25% of layers use InfLLM-V2 style sparse attention for high-fidelity long-range retrieval, while about 75% use Lightning linear attention, so compute scales close to linearly with sequence length. Instead of a uniform interleave, the sparse layers are placed via a 1:3 layer-selection pattern.

- For positional handling and stability, SALA uses hybrid positional encoding (HyPE): RoPE stays in the linear layers but is removed in sparse layers to avoid long-range decay, and it adds QK-normalization plus output gating to improve stability and reduce attention-sink behavior.

- Training is done by converting a pretrained Transformer, not training from scratch. It starts from a MiniCPM-4.0 intermediate checkpoint trained on 7T tokens, then applies HALO conversion, keeping the 1st and last layers unconverted and initially training only the converted linear layers.

Conversion plus post-training totals about 2T tokens, framed as about a 75% cost reduction versus an 8T scratch run, with context ramping from 512 to 4K, then to 32K, 160K, and 520K, followed by SFT at 64K and 140K.

Reported results keep standard performance strong (76.53 average, HumanEval 95.12, AIME24 83.75, AIME25 78.33)

While improving long-context behavior (RULER 92.65 at 64K, 89.37 at 128K). It also reports single-GPU 1M-token inference where Qwen3-8B OOMs, 256K TTFT, improving from 180.8s to 51.6s, and RULER holding at 86.3 at 1M and 81.6 at 2M without YaRN.

Go to Hugging Face/GitHub to test the model capabilities yourself.Image
🧵 2. The diagram compares a standard Transformer attention block on the right with the “hybrid” replacement block on the left.

On the right, softmax attention needs to keep a big key value cache for every past token, so as the context gets huge, the GPU runs out of memory and also slows down.

On the left, most layers swap that attention for an RNN-style “mixer” that keeps a running state S_t, so the model carries a compressed summary forward instead of storing per-token history, which makes very long context much cheaper in memory and compute.

The numbered marks show small but important fixes they apply during their HALO conversion, mainly hybrid positional encoding (HyPE) plus a few stability tweaks so the hybrid layers behave like the original Transformer at short context but do not fall apart at long context.

MiniCPM-SALA applies the same core idea at scale, keeping only 25% heavier attention style layers and making 75% of layers use cheaper attention variants, and the project claims this makes 1M token inference practical on a single RTX 5090 because KV cache pressure drops hard.Image
🧵 3. “Hybridizing attention” can keep quality while cutting long context memory and latency.

MiniCPM-SALA is the productized version of that same idea

In the paper, the researchers take a dense Transformer family (Qwen3) and convert it into a hybrid model they call HypeNet using a distillation recipe called HALO (Hybrid Attention via Layer Optimization), then they show HypeNet keeps performance while using less memory and avoiding the long context slowdown and out-of-memory failure you see in dense attention.

Also, the hybrid model can push higher throughput at a given quality level, meaning it generates tokens faster for the same kind of task, while the dense baseline slows down.

The right plot shows that, as context grows toward 1M, the dense Qwen3 version runs out of GPU memory, but the hybrid version still runs and keeps time per output token much lower.

The key architectural reason is that most layers stop using full softmax attention that needs a large key value cache for every past token, and instead use a cheaper hybrid or linear style mixer plus positional encoding changes like HyPE, so long context does not break.

This is the same general idea MiniCPM-SALA is selling: keep only a smaller fraction of heavier attention layers and make most layers cheaper, which is why they claim 1M token inference on a single RTX 5090.Image
Read 9 tweets
Jan 14
DeepSeek's innovation level is really at another level.

Its new paper just uncovered a new U-shaped scaling law.

Shows that N-grams still matter. Instead of dropping them in favor of neural networks, they hybridize the 2. This clears up the dimensionality problem and removes a big source of inefficiency in modern LLMs.

Uncovers a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram).

Right now, even “smart” LLMs waste a bunch of their early layers re-building common phrases and names from scratch, because they do not have a simple built-in “lookup table” feature.

Mixture-of-Experts already saves compute by only running a few expert blocks per token, but it still forces the model to spend compute to recall static stuff like named entities and formula-style text.

Engram is basically a giant memory table that gets queried using the last few tokens, so when the model sees a familiar short pattern it can fetch a stored vector quickly instead of rebuilding it through many layers.

They implement that query using hashed 2-gram and 3-gram patterns, which means the model always does the same small amount of lookup work per token even if the table is huge.

The big benefit is that if early layers stop burning time on “static reconstruction,” the rest of the network has more depth left for real reasoning, and that is why reasoning scores go up even though this sounds like “just memory.”

The long-context benefit is also solid, because offloading local phrase glue to memory frees attention to focus on far-away relationships, and Multi-Query Needle-in-a-Haystack goes from 84.2 to 97.0 in their matched comparison.

The system-level big deal is cost and scaling, because they show you can offload a 100B memory table to CPU memory and the throughput drop stays under 3%, so you can add a lot more “stored stuff” without needing to fit it all on GPU memory.Image
🧩 The core problem

The paper splits language modeling into 2 jobs, deep reasoning that needs real computation, and local stereotyped patterns that are basically fast recall.

Transformers do not have a native lookup block, so they burn early attention and feed-forward layers to rebuild static stuff like multi-token entities and formulaic phrases.

That rebuild is expensive mainly because it eats sequential depth, meaning the model spends layers on trivia-like reconstruction before it even starts the harder reasoning steps.

Classical N-gram models already handle a lot of this local dependency work with cheap table access, so forcing a Transformer to relearn it through compute is a design mismatch.

Engram is their way of turning “lookup” into a first-class primitive that lives next to MoE, instead of being faked by extra neural layers.Image
Engram adds a huge hashed N-gram memory table that gets queried with a fixed amount of work per token, so early layers stop wasting compute rebuilding names and stock phrases.

They show the best results when about 20% to 25% of the sparse budget moves from experts into this memory, while total compute stays matched.

Engram hits 97.0 on Multi-Query Needle-in-a-Haystack, while the matched MoE baseline hits 84.2.Image
Read 9 tweets
Jan 10
Anthropic has launched improved safety classifiers aimed at stopping AI jailbreaks.

The key idea is to add a cheap “early warning” safety check that runs all the time, and only turn on a stronger safety check when something looks suspicious, while also judging the user prompt and the model’s reply together so hidden or coded jailbreak requests do not slip through.

A new “Constitutional Classifiers++” setup stops universal jailbreaks without making the model expensive or annoying to use, by using a cheap internal probe to screen everything and only escalating suspicious chats to a stronger context-aware classifier.

40x less compute than a full exchange classifier, and 0.05% refusals on production traffic.

They proposed an “ensemble cascade” where an exchange classifier reads the prompt and the model’s partial answer together to beat obfuscation tricks, but it runs rarely because a near-free linear probe, built off the model’s own internal activations, decides when escalation is needed.Image
Image
🧠 The idea

A “Constitutional Classifier” is a separate model whose whole job is to read a conversation and decide whether the assistant should continue or refuse.

“Constitutional” here means the classifier is trained against a written rule set that says what is allowed and what is disallowed, like helping with normal chemistry learning but not with chemical weapon steps.

The core deployment problem is that defenses need to be cheap and rarely block normal users, because even a tiny refusal rate becomes real friction at scale.

This paper treats compute cost and refusal rate as first-class constraints, not afterthoughts, and the whole design is built around that.Image
🚨 What broke in the last system

The earlier setup split the job into 2 separate checks, one model looked only at the user input, and another looked only at the assistant output.

That split sounds clean, but it creates a blind spot when the harmful meaning is distributed across the prompt and the reply, so neither side looks bad alone.

Reconstruction-style attacks exploit this by hiding a harmful request inside a larger harmless-looking blob, then asking the model to piece it back together while answering.

Obfuscation-style attacks exploit it by pushing the model to speak in code words, riddles, or substitutions that look safe if the output is judged without the prompt context.

Some of these attack styles also damage normal model capability, and the paper shows GPQA Diamond accuracy dropping from 74.2% to 32.3% under 1 such jailbreak pattern, which signals the attack is “expensive” but still not something to rely on.
Read 7 tweets
Jan 1
🚨 BREAKING: DeepSeek dropped a core Transformer architecture improvement.

A traditional transformer is basically a long stack of blocks, and each block has a “main work path” plus a “shortcut path” called the residual connection that carries the input around the block and adds it back at the end.

Each block in this original transformer architecture does some work (self attention or a small feed forward network), then it adds the block’s input back onto the block’s output, which is why people describe it as a “main path” plus a “shortcut path.”

Hyper-Connections is a drop-in change to that shortcut path, because instead of carrying 1 stream of activations through the stack, the model carries a small bundle of parallel streams, then it learns how to mix them before a block and after a block.

Standard Transformers pass information through 1 residual stream. Hyper-Connections turn that into n parallel streams, like n lanes on a highway. Small learned matrices decide how much of each lane should mix into the others at every layer.

In a normal residual connection, each layer takes the current hidden state, runs a transformation, then adds the original back, so information can flow forward without getting stuck.

In this new Hyper-Connections, the layer does not see just 1 hidden state, it sees a small bundle of them, and before the layer it learns how to mix that bundle into the input it will process.

So in a traditional transformer block, wherever you normally do “output equals input plus block(input),” Hyper-Connections turns that into “output bundle equals a learned mix of the input bundle plus the block applied to a learned mix,” so the shortcut becomes more flexible than a plain add.

After this learned layer, the "Hyper-Connections" mechanism again learns how to mix the transformed result back into the bundle, so different lanes can carry different kinds of information, and the model can route signal through the shortcut in a more flexible way.

The catch is that if those learned mixing weights are unconstrained, stacking many blocks can make signals gradually blow up or fade out, and training becomes unstable in big models.

This paper proposes mHC, which keeps Hyper-Connections but forces every mixing step to behave like a safe averaging operation, so the shortcut stays stable while the transformer still gets the extra flexibility from multiple lanes.

---

The paper shows this stays stable at 27B scale and beats both a baseline and unconstrained Hyper-Connections on common benchmarks.

HC can hit about 3000x residual amplification, mHC keeps it around 1.6x.Image
This image compares 3 ways to build the shortcut path that carries information around a layer in a transformer.

The left panel is the normal residual connection, where the model adds the layer output back to the original input so training stays steady as depth grows.

The middle panel is Hyper-Connections, where the model keeps several parallel shortcut streams and learns how to mix them before the layer, around the layer, and after the layer, which can help quality but can also make the shortcut accidentally amplify or shrink signals when many layers stack.

The right panel is mHC, which keeps the same Hyper-Connections idea but forces those mixing steps to stay in a constrained safe shape every time, so the shortcut behaves like a controlled blend and stays stable at large scale.Image
What “hyper-connection” means here.

You widen the residual from size C to n×C, treat it as n streams, and learn 3 tiny mixing pieces per layer.

One mixes the residual streams with each other, this is the crucial one. One gathers from the streams into the layer. One writes results back to the streams.

The paper’s contribution is to keep the first one in the safe “doubly stochastic” set, so it mixes without amplifying.Image
Read 10 tweets
Dec 25, 2025
A MASSIVE 303 page study from the very best Chinese Labs.

The paper explains how code focused language models are built, trained, and turned into software agents that help run parts of development.

These models read natural language instructions, like a bug report or feature request, and try to output working code that matches the intent.

The authors first walk through the training pipeline, from collecting and cleaning large code datasets to pretraining, meaning letting the model absorb coding patterns at scale.

They then describe supervised fine tuning and reinforcement learning, which are extra training stages that reward the model for following instructions, passing tests, and avoiding obvious mistakes.

On top of these models, the paper surveys software engineering agents, which wrap a model in a loop that reads issues, plans steps, edits files, runs tests, and retries when things fail.

Across the survey, they point out gaps like handling huge repositories, keeping generated code secure, and evaluating agents reliably, and they share practical tricks that current teams can reuse.Image
Overview of the evolution of code large language models (Code-LLMs) and related ecosystems from 2021 to 2025. Image
Evolution of programming development and research landscapes in AI-powered code generation. Image
Read 17 tweets
Dec 3, 2025
A MASSIVE 303 page study from the very best Chinese Labs.

The paper explains how code focused language models are built, trained, and turned into software agents that help run parts of development.

These models read natural language instructions, like a bug report or feature request, and try to output working code that matches the intent.

The authors first walk through the training pipeline, from collecting and cleaning large code datasets to pretraining, meaning letting the model absorb coding patterns at scale.

They then describe supervised fine tuning and reinforcement learning, which are extra training stages that reward the model for following instructions, passing tests, and avoiding obvious mistakes.

On top of these models, the paper surveys software engineering agents, which wrap a model in a loop that reads issues, plans steps, edits files, runs tests, and retries when things fail.

Across the survey, they point out gaps like handling huge repositories, keeping generated code secure, and evaluating agents reliably, and they share practical tricks that current teams can reuse.Image
Overview of the evolution of code large language models (Code-LLMs) and related ecosystems from 2021 to 2025. Image
Evolution of programming development and research landscapes in AI-powered code generation. Image
Read 17 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(