The top-most Universities from US, UK, EU, China, Canada, Singapore, Australia collaborated.
Will completely research paper writing.
They proved, AI can already draft proposals, run experiments, and write papers.
The authors built aiXiv, a new open-access platform where AI and humans can submit, review, and revise research in a closed-loop system.
The system uses multiple AI reviewers, retrieval-augmented feedback, and defenses against prompt injection to ensure that papers actually improve after review.
And the process worked: AI-generated proposals and papers get much better after iterative review, with acceptance rates jumping from near 0% to 45% for proposals and from 10% to 70% for papers.
🧵 Read on 👇
🧵2/n. Across real experiments it hits 77% proposal ranking accuracy, 81% paper ranking accuracy, blocks prompt‑injection with up to 87.9% accuracy, and pushes post‑revision acceptance for papers from 10% to 70%.
81% paper accuracy, 87.9% injection detection, papers 10%→70% after revision.
🧵3/n. This diagram shows aiXiv’s closed-loop system where AI and humans submit work, get automated reviews, revise, and then publish once quality clears the bar.
It means the platform is not a simple preprint dump, it is a workflow that forces measurable improvement each cycle.
Review agents score novelty, soundness, clarity, and feasibility using retrieval so feedback is grounded, and a prompt-injection detector screens malicious instructions before any model reads the file.
If the revised version looks better in pairwise checks, it moves forward, then a panel of LLMs votes, and 3 of 5 accepts trigger publication.
So the figure is saying aiXiv operationalizes end-to-end research, from idea to accepted paper, with guardrails and iteration built in.
🧵4/n. 🚧 Why this is needed
LLMs can already draft proposals, run experiments, and write papers, but journals resist AI authors and preprints miss screening, so strong AI‑generated research has nowhere credible to land.
This platform targets that gap by pairing automated review with structured revision so content quality is tracked and improved, not just posted.
🧵5/n. ⚙️ The Core Concepts
An AI or human submits a proposal or paper, review agents score novelty, soundness, clarity, and feasibility, then return concrete fixes, the author revises, and the loop repeats until it clears the bar.
The loop is submission → automated review → revision → re‑evaluation → decision, which keeps pressure on actual improvements rather than one‑shot verdicts.
Each accepted item gets a DOI and explicit IP credit to the model developer and any initiating human, so attribution is clear from day one.
A public UI lets people like, comment, and discuss, which gives extra feedback signals to steer agent behavior.
🧵6/n. 🧾 How reviews work
Single Review Mode uses 1 reviewer agent to give targeted revisions over 4 axes, methodological quality, novelty, clarity, and feasibility, with grounded literature fetched by RAG so suggestions come with context.
Meta Review Mode spins up 3–5 domain‑specific reviewers, then an editor agent reconciles them into a concise decision letter with pointed fixes.
Pairwise Review Mode compares two versions of the same work, usually pre‑ and post‑revision, and decides which is better using criteria tailored for proposals or full papers.
Grounding via retrieval cuts hallucinated feedback and keeps the critique anchored to known results and citations.
🧵7/n. 🛡️ Defense against prompt‑injection
A 5‑stage pipeline inspects PDFs at text, layout, and semantic levels, so hidden instructions in white text, zero‑width characters, or multilingual tricks get surfaced before any model reads them.
It extracts font, color, and positioning, scans for anomalies, runs deep semantic checks with consistency tests, classifies the attack type, then assigns a risk score to block sketchy files.
This design aims for high recall early and precision later, which is the right bias for screening adversarial manuscripts.
🧵8/n. ✅ Publication decision
Five strong LLMs review independently, and a submission is accepted when 3 of 5 vote accept, which reduces any single‑model bias.
Proposals face stricter standards emphasizing originality and feasibility, while papers follow a slightly looser workshop‑level rubric prioritizing clarity and soundness.
Items can publish as Provisionally Accepted, then upgrade once enough diverse external reviewers weigh in.
🧵9/n. 📊 What the experiments say
Proposal ranking with RAG hits 77% on ICLR‑derived pairs, beating a 71% baseline reported in prior work.
Paper‑level ranking reaches 81%, which is solid given long contexts and messy drafts.
Prompt‑injection detection scores 84.8% on synthetic adversarials and 87.9% on suspicious real samples.
After agents review and authors revise, >90% of proposals and papers are preferred over the originals, and with a short response letter that climbs toward ~100%.
Majority voting mirrors that lift, proposals jump from 0% to 45.2% accepted on average, and papers jump from 10% to 70%.
🧵10/n. 🔌 Interfaces and ecosystem
An API plus Model Control Protocol lets heterogeneous agents plug in as authors, reviewers, and meta‑reviewers without glue code.
Accepted items get a DOI and explicit IP attribution, which matters for crediting both the human initiator and the model developer.
Community reactions, likes and comments, feed back as weak signals to help align agent behavior with evolving norms.
Paper –
Paper Title: "aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists"arxiv.org/abs/2508.15126
• • •
Missing some Tweet in this thread? You can try to
force a refresh
This is that original MIT report that said 95% of AI pilots fail and which spooked investors across US Stockmarket.
The reports says, most companies are stuck, because 95% of GenAI pilots produce zero ROI, while a small 5% win by using systems that learn, plug into real workflows, and improve with use.
Teams keep buying or building static tools that demo well but cannot remember context, adapt, or fit daily operations, and this report maps exactly how the few winners do it differently.
🧪 How they ran the study
They combined a review of 300+ public implementations with 52 structured interviews and 153 senior‑leader surveys across January to June 2025, which gives the patterns below real footing.
🧵 Read on 👇
The big split they call the GenAI Divide is simple, 95% of organizations get nothing from GenAI pilots while a tiny 5% extract millions, and the driver is not the model itself but whether the system can learn, remember, and fit the workflow.
The steep drop from pilots to production for task-specific GenAI tools reveals the GenAI divide
The First method to achieve 99.9% on AIME 2025 with open-source models! 🤯
DeepConf uses a model’s own token confidence to keep only its strongest reasoning, with GPT-OSS-120B while cutting tokens by up to 84.7% compared to standard parallel thinking.
Most systems still lean on self-consistency with majority voting, which lifts accuracy but hits diminishing returns and burns a lot of tokens.
🧠 The key idea
DeepConf is a test-time method that scores the model’s reasoning locally for confidence, filters weak traces, and often improves accuracy with fewer tokens without any extra training or tuning.
🧱 Why majority voting hits a wall
Parallel thinking samples many chains and votes, accuracy grows slowly as samples rise so compute scales linearly and the benefit flattens, which is exactly the pain DeepConf targets.
🔎 The confidence signals
Token confidence is the negative mean log probability of the top k candidates at each step, which gives a direct signal of how sure the model is at that moment.
Group confidence averages token confidence over a sliding window so local dips are visible without noise from the whole trace.
Tail confidence averages the last chunk of tokens because the ending steps decide the final answer and are where good traces often slip.
Bottom 10% group confidence looks at the worst parts of a trace, which is a strong indicator that the overall reasoning is shaky.
Lowest group confidence picks the single weakest window along a trace, which turns out to be a clean gate for dropping that trace early.
✅ Bottom line
DeepConf is a plug-in test-time compression recipe that filters or halts weak reasoning in place, so teams get higher accuracy and a big token cut without retraining or new hyperparameters.
🧮 Offline mode, smarter voting
DeepConf ranks traces by a confidence score and does confidence-weighted majority voting after optionally keeping only the top 10% or the top 90% by confidence.
With 512 traces, GPT-OSS-120B reaches 99.9% on AIME 2025 using tail or lowest-group confidence with filtering, compared to 97.0% for plain voting and 91.8% for pass@1.
⚡ Online mode, early stop while generating
A short warmup of 16 traces sets a stopping threshold s from the confidence distribution for the current problem.
During live generation, a trace stops the moment its lowest group confidence falls below s, so weak lines of thought do not waste tokens.
An adaptive sampling loop adds traces until the consensus is high enough, or a set budget like 512 is reached.
In short, package stable context up front, give exact instructions and examples, restate the current ask, let the model reason, and demand a strict output format.
🧵 Read on 👇
🧵2/n Start with task context. Tell the model who it is, what domain it is in, and what outcome matters. In the demo, the first try misread the images as a skiing incident. Adding “you are assisting a Swedish car-insurance claims adjuster” fixed that because it anchored the model in the right world and goal.
🧵3/n Add tone context. Specify how to behave, for example “be factual, be confident only when evidence is clear, say you are unsure if you cannot tell.” This reduces guessing and aligns the model’s attitude with the task. The presenters explicitly ask the model not to invent details and to avoid a verdict unless it is sure.
A small Qwen2.5 model is fine-tuned to think over retrieved documents, so a single lean setup can answer domain questions on resource-constrained local hardware.
Using summarised NHS pages, retrieval hits the right condition among top‑5 in 76% of queries, and the fine‑tuned model predicts the exact condition correctly 56% of the time, close to larger frontier models.
The whole pipeline is built for private deployments, so teams can run it without sending data to external APIs.
🔒 The problem they tackle
Many teams cannot ship prompts or data outside their network, especially in health and government, so cloud LLM endpoints are off the table.
They aim for a single lean model that can read retrieved evidence and reason over it, all running locally, so answers stay grounded and private.
The target setting is messy queries over a closed corpus, where retrieval constrains facts and the reasoning step interprets symptoms and next actions.
🧩 The pipeline in this paper.
The system indexes a corpus, retrieves the most relevant pieces for each query, then generates an answer that reasons over those pieces.
They use a classic retriever plus generator design, with retrieval first then reasoning, which fits decision tasks better than free‑form answering.
The chat flow lets a conversational agent decide when to call retrieval, then passes the retrieved context to the reasoning model to produce the answer.
🧵 Read on 👇
🧲 The retriever at work
Documents are split into overlapping chunks and embedded with a sentence transformer, then stored in a vector database for fast similarity search.
They use sentence-transformers all‑mpnet‑base‑v2, which maps text into a 768‑dimensional space with a max sequence of 384 tokens, and a Chroma store with L2 similarity.
If any chunk from a document makes the top‑k, the pipeline feeds the full original document to the LLM, so the model sees full context around the hit.
Below image shows the whole training loop for their lean, retrieval-augmented reasoning setup.
It starts with a private knowledge base of about 1,000 NHS condition pages. GPT-4o generates about 2,000 synthetic patient queries from those pages, so they have realistic questions tied to known answers.
For each query, a retriever pulls the top 5 likely documents. DeepSeek-R1 reads those documents and the query, then produces a final label plus a step-by-step reasoning trace. That bundle becomes one training example.
They then fine-tune Qwen-32B-Instruct on this data and distill it into a smaller t0-1 reasoning model. The result is a compact model that learns to reason over retrieved evidence from the approved corpus, so it can run locally and stay grounded.
Absolutely beautiful and exhaustive 82 page survey paper on on Efficient Architectures for Large Language Models
Maps the ways to make LLMs cheaper, longer context, and near real time.
Transformers compare every token with every other token, so if text is 2x longer, the work is about 4x. That burns memory because past keys and values are stored for every attention head, and it drags latency during long chats or reasoning loops.
The survey groups fixes into 4 buckets. Linear sequence models redo the math so cost grows with length, not length squared.
They include linear attention, recurrent networks that carry a small state, and state space models like Mamba, which track history with a running summary, so no big cache.
Sparse attention keeps the Transformer idea but only connects important pairs. Most tokens look locally, a few tokens act as global anchors, and some methods route tokens to the right places. You get large savings without throwing away core behavior.
Efficient full attention keeps exact attention but makes it hardware friendly. Input output aware kernels such as FlashAttention cut reads and writes, and multi-query or grouped-query attention lets many heads share 1 key-value set, cutting cache and bandwidth.
Sparse Mixture of Experts adds conditional compute. Only a few experts run per token, so capacity grows without paying full cost each step, and memory tricks compress, quantize, or prune the cache to stretch context.
The theme is simple, move less data. Methods that cut memory traffic tend to win on modern GPUs, which enables longer context, faster training, and lower serving cost.
This figure is a roadmap of how to make LLMs faster and cheaper from input tokens to output tokens.
The center shows Efficient Sequence Modeling. One path makes sequence cost scale linearly using things like linear attention, linear recurrent networks, and state space models, plus test-time-training variants and unified linear sequence models.
Another path saves work by using sparse attention so the model only looks at the most useful token pairs.
A third path keeps full attention but makes it cheaper with input-output aware scheduling, grouped attention, mixtures of different attention types, and quantization.
Below that sits Sparse Mixture-of-Experts. The model grows capacity by keeping many experts but routes each token to only a few, so compute per token stays low. Different routing rules, expert designs, and conversion tricks live here.
To the right are Hybrid Architectures. These mix building blocks across layers or inside a layer to hit better speed and accuracy tradeoffs.
Next is Diffusion LLM. This family targets non-autoregressive generation so many tokens can be produced in parallel, with methods to connect back to standard autoregressive decoding and to extend into multimodal settings.
The final column highlights reach beyond text, showing where these efficiency ideas apply to vision, audio, and multimodal tasks.
How can we break through the Transformer’s
efficiency ceiling? Is costly "intelligence" our only path forward?
LEANN: The Tiniest Vector Database that Democratizes Personal AI with Storage-Efficient Approximate Nearest Neighbor (ANN) Search Index
Researchers from UC Berkeley, CUHK, Amazon Web Services, and UC Davis have developed LEANN, a storage-efficient ANN search index optimized for resource-limited personal devices.
RAG on Everything with LEANN. Enjoy 97% storage savings while running a fast, accurate, and 100% private RAG application on your personal device.