Latest Twitter Threads by @rohanpaul_ai on Thread Reader App

Oct 15 • 7 tweets • 5 min read

BIG success for LLMs in financial trading & decision making.

New Stanford + Univ California study proves a 4B financial-domain model, Trading-R1, writes clear analyst theses and turns them into profitable trades.

Its trained on 100K cases over 18 months across 14 tickers, and its backtests show better risk-adjusted returns with smaller drawdowns.

The problem it tackles is simple, quant models are hard to read, and general LLMs write nice text that does not translate into disciplined trades.

The solution starts by forcing a strict thesis format, with separate sections for market data, fundamentals, and sentiment, and every claim must point to evidence from the given context.

Then it learns decisions by mapping outcomes into 5 labels, strong buy, buy, hold, sell, strong sell, using returns that are normalized by volatility over several horizons.

For training, it first copies high-quality reasoning distilled from stronger black-box models using supervised fine-tuning, then it improves with a reinforcement method called group relative policy optimization.

In held-out tests on NVDA, AAPL, AMZN, META, MSFT, and SPY, the combined approach beats small and large baselines on Sharpe and max drawdown, and the authors position it as research support, not high-frequency automation.

🧵 Read on 👇

🧵2/n. The 3 steps used to train Trading-R1.

The first step is Structure. The model is taught how to write a thesis in a clear format. It must separate parts like market trends, company fundamentals, and sentiment, and it has to place each claim in the right section.

The second step is Claims. Here the model learns that any claim it makes must be supported by evidence. For example, if it says revenue is growing, it must back that with a source or number provided in the context.

The third step is Decision. The model turns the structured thesis into an actual trading action. It predicts outcomes like strong buy, buy, hold, sell, or strong sell. Its prediction is checked against the true outcome, and it gets rewards or penalties depending on accuracy.

Each step first uses supervised fine-tuning, which means training on examples with correct answers, and then reinforcement fine-tuning, which means refining the model by giving rewards when it produces better outputs.

Finally, all stages are combined, producing Trading-R1, a model that can both write well-structured financial reasoning and map that reasoning into actual trading decisions.

Oct 12 • 11 tweets • 5 min read

"The Impact of Artificial Intelligence on Human Thought"

A big 132 page report.

AI is shifting real thinking work onto external systems, which boosts convenience but can weaken the effort that builds understanding and judgment,

A pattern the paper frames through cognitive offloading and cognitive load theory, and then tracks into social effects like standardized language and biased information flows, and manipulation tactics that target human psychology.

It says use AI to cut noise and routine steps, keep humans doing the heavy mental lifting, and add controls because personalization, deepfakes, and opaque models can steer choices at scale.

🧵 Read on 👇

🧵2/n. ⚙️ The Core Concepts

Cognitive load theory says working memory is limited, so AI helps when it reduces extraneous load and hurts when it replaces the germane load needed to build skill.

In plain terms, let tools clean up the interface and fetch data, but keep people doing the analysis, explanation, and sense‑making.

Oct 10 • 4 tweets • 2 min read

Rude prompts to LLMs consistently lead to better results than polite ones 🤯

The authors found that very polite and polite tones reduced accuracy, while neutral, rude, and very rude tones improved it.

Statistical tests confirmed that the differences were significant, not random, across repeated runs.

The top score reported was 84.8% for very rude prompts and the lowest was 80.8% for very polite.

They compared their results with earlier studies and noted that older models (like GPT-3.5 and Llama-2) behaved differently, but GPT-4-based models like ChatGPT-4o show this clear reversal where harsh tone works better.

----

Paper – arxiv. org/abs/2510.04950

Paper Title: "Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper)"

Average accuracy and range across 10 runs for five different tones

Oct 9 • 12 tweets • 7 min read

This is one of THE BRILLIANT papers with a BIG claim. 👏

Giving an LLM just 78 carefully chosen, full workflow examples makes it perform better at real agent tasks than training it with 10,000 synthetic samples.

"Dramatically outperforms SOTA models: Kimi-K2-Instruct, DeepSeek-V3.1, Qwen3-235B-A22B-Instruct and GLM-4.5. " on AgencyBench (LIMI at 73.5%)

The big deal is that quality and completeness of examples matter way more than raw data scale when teaching models how to act like agents instead of just talk.

They name the Agency Efficiency Principle, which says useful autonomy comes from a few high quality demonstrations of full workflows, not from raw scale.

The core message is strategic curation over scale for agents that plan, use tools, and finish work.

🧵 Read on 👇

🧵2/n. In summary how LIMI (Less Is More for Intelligent Agency) can score so high with just 78 examples.

1. Each example is very dense
Instead of short one-line prompts, each example is a full workflow. It contains planning steps, tool calls, human feedback, corrections, and the final solution. That means 1 example teaches the model dozens of small but connected behaviors.

2. The tasks are carefully chosen
They didn’t just collect random problems. They picked tasks from real coding and research workflows that force the model to show agency: breaking down problems, tracking state, and fixing mistakes. These skills generalize to many other tasks.

3. Complete trajectories, not fragments
The dataset logs the entire process from the first thought to the final answer. This is like showing the model not only the answer key but the full worked-out solution, so it can copy the reasoning pattern, not just the result.

4. Less noise, more signal
Large datasets often have lots of filler or synthetic tasks that don’t push real agent skills. LIMI avoids that by strict quality control, so almost every token in the dataset contributes to useful learning.

5. Scale of information per token
Because each trajectory is huge (tens of thousands of tokens), the model effectively sees way more “learning signal” than the raw count of 78 samples suggests. The richness of one trajectory can outweigh hundreds of shallow synthetic prompts.

Oct 5 • 9 tweets • 5 min read

Absolutely classic @GoogleResearch paper on In-Context-Learning by LLMs.

Shows the mechanisms of how LLMs learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change.

💡The mechanism they reveal for in-context-learning.

When the model reads a few examples in your prompt, it figures out a pattern (like a small rule or function). Instead of permanently changing its stored weights, it forms a temporary adjustment that captures this pattern. That adjustment can be written mathematically as a rank-1 matrix, meaning it only adds one simple direction of change to the existing weights.

This rank-1 update is “low-rank”, so it is very cheap and compact. But it still lets the model shift its behavior to fit the examples in the prompt. Once the prompt is gone, that temporary rank-1 tweak also disappears.

So, in simple terms:
The paper shows that in-context learning happens because the model internally applies a temporary rank-1 (very simple) weight update based on your examples, instead of permanently retraining itself.

---

That behavior looks impossible if learning always means gradient descent.

The authors ask whether the transformer’s own math hides an update inside the forward pass.

They show, each prompt token writes a rank 1 tweak onto the first weight matrix during the forward pass, turning the context into a temporary patch that steers the model like a 1‑step finetune.

Because that patch vanishes after the pass, the stored weights stay frozen, yet the model still adapts to the new pattern carried by the prompt.

---

Shows that the attention part can take what it found in your prompt and package it into a tiny “instruction” that, for this 1 forward pass, acts exactly like a small temporary change to the MLP’s weights.

Nothing is saved to disk, yet the block behaves as if the MLP just got a low-rank tweak computed from your examples. Remove the prompt, the tweak disappears, the saved weights stay the same.

As the model reads your examples token by token, it keeps refining that temporary tweak. Each new token nudges the MLP a bit more toward the rule implied by your examples, similar to taking small gradient steps, again only for this pass.

When the examples have done their job, those nudges shrink toward 0, which is what you want when the pattern has been “locked in” for the current answer.

🧵 Read on 👇

🧵2/n. ⚙️ The Core Idea

They call any layer that can read a separate context plus a query a “contextual layer”.

Stack this layer on top of a normal multilayer perceptron and you get a “contextual block”.

For that block, the context acts exactly like a rank 1 additive patch on the first weight matrix, no matter what shape the attention takes.

Oct 3 • 5 tweets • 4 min read

🚫 This @Microsoft paper brings really bad news for medical AI models. Exposes some serious flaws.

AI models just aren’t ready yet for reliable medical reasoning. 🤯

Paper finds that medical AI model pass tests by exploiting patterns in the data, not by actually combining medical text with images in a reliable way.

While medical AI models look good on benchmarks, in reality they can not handle real medical reasoning.

The key findings are that models overuse shortcuts, break under small changes, and produce unfaithful reasoning.

This makes the medical AI model's benchmark results misleading if someone assumes a high score means the model is ready for real medical use.

---

The specific key findings from this paper 👇

- Models keep strong accuracy even when images are removed, even on questions that require vision, which signals shortcut use over real understanding.

- Scores stay above the 20% guess rate without images, so text patterns alone often drive the answers.

- Shuffling answer order changes predictions a lot, which exposes position and format bias rather than robust reasoning.

- Replacing a distractor with “Unknown” does not stop many models from guessing, instead of abstaining when evidence is missing.

- Swapping in a lookalike image that matches a wrong option makes accuracy collapse, which shows vision is not integrated with text.

- Chain of thought often sounds confident while citing features that are not present, which means the explanations are unfaithful.

- Audits reveal 3 failure modes, incorrect logic with correct answers, hallucinated perception, and visual reasoning with faulty grounding.

- Gains on popular visual question answering do not transfer to report generation, which is closer to real clinical work.

- Clinician reviews show benchmarks measure very different skills, so a single leaderboard number misleads on readiness.

- Once shortcut strategies are disrupted, true comprehension is far weaker than the headline scores suggest.

- Most models refuse to abstain without the image, which is unsafe behavior for medical use.

- The authors push for a robustness score and explicit reasoning audits, which signals current evaluations are not enough.

🧵 Read on 👇

🧵2/n. The below figure tells us that high scores on medical benchmarks can mislead, because stress tests reveal that current models often rely on shallow tricks and cannot be trusted for reliable medical reasoning.

The first part highlights 3 hidden fragilities: hallucinated perception, shortcut behavior, and faulty reasoning.

The second part compares benchmark accuracy with robustness scores, and while accuracy looks high, robustness drops sharply, which means models get brittle under small changes.

The heatmap shows how stress tests like removing images, shuffling answers, or replacing distractors reveal specific failure patterns in each model.

The example at the bottom shows that a model can still give the right answer even without seeing the image, which is a shortcut, or it can make up a detailed explanation that mentions things not actually in the image, which is fabricated reasoning.

Sep 28 • 8 tweets • 5 min read

🔥 Meta reveals a massive inefficiency in AI’s reasoning process and gives a solution.

Large language models keep redoing the same work inside long chains of thought.

For example, when adding fractions with different denominators, the model often re explains finding a common denominator step by step instead of just using a common denominator behavior.

In quadratic equations, it re explains the discriminant logic or completes the square again instead of calling a solve quadratic behavior.

In unit conversion, it spells out inches to centimeters again instead of applying a unit conversion behavior.

🛑The Prblem with this approach is, when the model re explains a routine, it spends many tokens on boilerplate steps that are identical across problems which is wasted budget.

So this paper teaches the model to compress those recurring steps into small named behaviors that it can recall later or even learn into its weights.

A behavior compresses that routine into a short name plus instruction like a tiny macro that the model can reference.

At inference, a small list of relevant behaviors is given to the model or already internalized by training so the model can say which behavior it is using and skip the long re derivation.

Because it points to a named behavior, the output needs fewer tokens, and the saved tokens go to the new parts of the question.

Behavior conditioned fine tuning teaches the weights to trigger those routines from the question alone so even without retrieval the model tends to use the right shortcut.

Compute shifts from many output tokens to a few input hints and weight activations which is cheaper in most serving stacks and usually faster too.

Accuracy can improve because the model follows a tested routine instead of improvising a fresh multi step derivation that may drift.

🧵 Read on 👇

🧵2/n. ⚙️ The Core Concepts

A behavior is a short name plus instruction for a reusable move, like inclusion exclusion or turning words into equations.

The behavior handbook is a store of how-to steps, which is procedural memory, unlike RAG that stores facts.

The authors frame the goal as remember how to reason, not just what to conclude, which matches the engineer’s point that remembering how to think beats thinking longer.

Sep 28 • 16 tweets • 7 min read

One of the best paper of the recent week.

The big takeaway: scaling up model size doesn’t just make models smarter in terms of knowledge, it makes them last longer on multi-step tasks, which is what really matters for agents.

Shows that small models can usually do one step perfectly, but when you ask them to keep going for many steps, they fall apart quickly.

Even if they never miss on the first step, their accuracy drops fast as the task gets longer.

Large models, on the other hand, stay reliable across many more steps, even though the basic task itself doesn’t require extra knowledge or reasoning.

The paper says this is not because big models "know more," but because they are better at consistently executing without drifting into errors

The paper names a failure mode called self-conditioning, where seeing earlier mistakes causes more mistakes, and they show that with thinking steps GPT-5 runs 1000+ steps in one go while others are far lower.

🧵 Read on 👇

🧵2/n. 🧠 The idea

The work separates planning from execution, then shows that even when the plan and the needed knowledge are handed to the model, reliability drops as the task gets longer, which makes small accuracy gains suddenly matter a lot.

Sep 25 • 13 tweets • 10 min read

🚨 BAD news for Medical AI models.

MASSIVE revelations from this @Microsoft paper.

🤯 Current medical AI models may look good on standard medical benchmarks but those scores do not mean the models can handle real medical reasoning.

The key point is that many models pass tests by exploiting patterns in the data, not by actually combining medical text with images in a reliable way.

The key findings are that models overuse shortcuts, break under small changes, and produce unfaithful reasoning.

This makes the medical AI model's benchmark results misleading if someone assumes a high score means the model is ready for real medical use.

---

The specific key findings from this paper 👇

- Models keep strong accuracy even when images are removed, even on questions that require vision, which signals shortcut use over real understanding.

- Scores stay above the 20% guess rate without images, so text patterns alone often drive the answers.

- Shuffling answer order changes predictions a lot, which exposes position and format bias rather than robust reasoning.

- Replacing a distractor with “Unknown” does not stop many models from guessing, instead of abstaining when evidence is missing.

- Swapping in a lookalike image that matches a wrong option makes accuracy collapse, which shows vision is not integrated with text.

- Chain of thought often sounds confident while citing features that are not present, which means the explanations are unfaithful.

- Audits reveal 3 failure modes, incorrect logic with correct answers, hallucinated perception, and visual reasoning with faulty grounding.

- Gains on popular visual question answering do not transfer to report generation, which is closer to real clinical work.

- Clinician reviews show benchmarks measure very different skills, so a single leaderboard number misleads on readiness.

- Once shortcut strategies are disrupted, true comprehension is far weaker than the headline scores suggest.

- Most models refuse to abstain without the image, which is unsafe behavior for medical use.

- The authors push for a robustness score and explicit reasoning audits, which signals current evaluations are not enough.

🧵 Read on 👇

Sep 24 • 6 tweets • 4 min read

This is such a brilliant paper.

If this spreads, new research won’t just be something you read, it’ll be something you can use immediately.

It will lower barriers, saves huge amounts of time, and could make science much more reliable and connected.

Normally, a paper is just a PDF plus maybe some code, and if you want to use it you have to install dependencies, debug environments, and figure out parameters. That’s hard and often stops people from ever using the method.

Paper2Agent skips all that. It automatically converts a paper into an interactive AI agent. You can talk to it in plain language, and it will actually run the real code, with the right data and setup, and give you results. No setup or manual fixing needed.

They proved it works on heavy-duty cases like AlphaGenome, TISSUE, and Scanpy, and the agents reproduced the original paper’s results with 100% accuracy, even on brand new queries.

⚙️ The Core Concepts

The framework represents each paper as an MCP server that bundles executable tools, static resources, and step-by-step prompts, then any LLM agent can call those tools with plain language to run the paper’s method.

This shifts research output from a passive document to an interactive system that demonstrates, applies, and adapts the paper’s ideas on demand.

It automates environment setup, extracts tools from the repo and tutorials, and tests them until outputs match the originals.

🧵 Read on 👇

🧵2/n. 🧩 Why MCP

Model Context Protocol gives a standard way to expose functions and data with clear inputs and outputs, so agents can call them reliably without custom glue code.

Paper2Agent uses this to encode datasets, code paths, and multi step workflows so the paper becomes addressable, composable, and easy to query.

Sep 24 • 12 tweets • 7 min read

Brilliant @GoogleDeepMind paper, a major advancement in embedding-based search.

Most regular search systems return only exact or near matches, they miss those farther-up categories that still matter, the bigger parent categories your query belongs to.

And this paper’s simple 2-step training pulls those in reliably, lifting far-away match accuracy from 19% to 76%.

Meaning, the Long-distance recall jumps from 19% to 76% on WordNet with low-dimensional embeddings.

Long-distance recall is the % of the far-away relevant items that your search actually returns.

“Far-away” means items several steps up the category tree from your query, like “Footwear” for “Kid’s sandals”.

You compute it by looking only at those distant ancestors, counting how many should be returned, counting how many you actually returned, then doing hits divided by should-have.

If there are 10 such ancestors and your system returns 7, long-distance recall is 70%.

⚙️ The Core Concepts

Hierarchical retrieval expects a query to bring back its own node and all more general ancestors, which is asymmetric, so the same concept must embed differently on the query side and the document side.

Euclidean geometry creates tension across queries, yet an asymmetric dual-encoder can resolve it with careful scoring.

The running example is “Kid’s sandals,” where “Sandals” is relevant to that query, but the reverse is not, which motivates asymmetric scoring.

🧠 The idea of this paper.

Dual encoders can solve hierarchical retrieval when query and document embeddings are asymmetric and the needed dimension grows gently with hierarchy depth and log of catalog size.

A simple schedule, pretrain on regular pairs then finetune on long-distance pairs, fixes misses on far ancestors without hurting close matches.

🧵 Read on 👇

🧵2/n. 🧩 Quick outline

The task is to retrieve the exact node plus all more general ancestors, so relevance is one-way.

They formalize the setup, train with a softmax loss over in-batch negatives, and score by recall.

They prove feasible Euclidean embeddings exist with a dimension that scales with depth and log of size.

Synthetic trees show learned encoders work at much smaller dimensions than the constructive bound.

Tiny dimensions fail mainly on far ancestors, the “lost-in-the-long-distance” effect.

Up-weighting far pairs hurts near pairs, so rebalancing alone fails.

Pretrain then finetune only on far pairs lifts all distance slices with early stopping to protect near pairs.

On WordNet and Amazon ESCI, the recipe gives clear recall gains across slices.

Sep 23 • 9 tweets • 5 min read

Tencent just took a big step beyond GRPO by introducing Single-stream Policy Optimization (SPO)

An approach that fixes GRPO’s wasted compute from degenerate groups and its constant group synchronization stalls, making training both faster and more stable.

🧠 The idea

SPO trains with 1 response per prompt, keeps a persistent baseline per prompt, and normalizes advantages across the batch, which stabilizes learning and cuts waste.

This removes degenerate groups that give 0 signal and avoids group synchronization stalls in distributed runs.

On math reasoning with Qwen3-8B, SPO improves accuracy and learns more smoothly than GRPO.

Sensational fact: 4.35× throughput speedup in an agentic simulation and +3.4 pp maj@32 over GRPO.

🧵 Read on 👇

🧵2/n. 🧩 The problem with group-based training

Group-based methods sample many responses per prompt to compute a relative baseline, but when every response in a group is all correct or all wrong the advantages become 0 and the step gives no gradient.

Heuristics like dynamic sampling try to force a non-zero advantage, but they add complexity and keep a synchronization barrier that slows large-scale training.

Sep 19 • 7 tweets • 5 min read

LLM for financial trading/decision making.

A 4B model financial-domain model, Trading-R1, that writes clear analyst theses and turns them into trades.

Its trained on 100K cases over 18 months across 14 tickers, and its backtests show better risk-adjusted returns with smaller drawdowns.

The problem it tackles is simple, quant models are hard to read, and general LLMs write nice text that does not translate into disciplined trades.

The solution starts by forcing a strict thesis format, with separate sections for market data, fundamentals, and sentiment, and every claim must point to evidence from the given context.

Then it learns decisions by mapping outcomes into 5 labels, strong buy, buy, hold, sell, strong sell, using returns that are normalized by volatility over several horizons.

For training, it first copies high-quality reasoning distilled from stronger black-box models using supervised fine-tuning, then it improves with a reinforcement method called group relative policy optimization.

In held-out tests on NVDA, AAPL, AMZN, META, MSFT, and SPY, the combined approach beats small and large baselines on Sharpe and max drawdown, and the authors position it as research support, not high-frequency automation.

🧵 Read on 👇

Sep 18 • 12 tweets • 7 min read

👨‍🔧 The DeepSeek R1 Nature paper’s supplementary notes are a goldmine across 83 solid pages.

everything from training data and hyperparameters to why the base model matters.

Reinforcement learning, not just supervised fine-tuning, is what pushes DeepSeek‑R1 to generate long, reflective reasoning that actually fixes its own mistakes.

They train with Group Relative Policy Optimization, drop the value model, manage divergence to a moving reference, and let the model scale test‑time thinking to crack harder problems.

🧵 Read on 👇

🧵2/n. 🔁 GRPO, not PPO

Group Relative Policy Optimization samples a small group of answers for each prompt, scores them, normalizes those scores within the group, and updates the policy toward the better ones while skipping a separate value model.

They control drift with an unbiased Kullback–Leibler estimate against a reference policy and periodically refresh that reference, which avoids over‑penalizing long responses and cuts memory and compute.

On the same backbone, Proximal Policy Optimization needed careful lambda tuning to approach GRPO on math, which made GRPO the lower‑friction choice in practice.

Sep 18 • 5 tweets • 4 min read

🇨🇳 DeepSeek-R1 was published in Nature yesterday as the cover article for their BRILLIANT latest research.

They show that pure Reinforcement Learning with answer-only rewards can grow real reasoning skills, no human step-by-step traces required.

So completely skip human reasoning traces and still get SOTA reasoning via pure RL.

It’s so powerful revelation, because instead of forcing the model to copy human reasoning steps, it only rewards getting the final answer right, which gives the model freedom to invent its own reasoning strategies that can actually go beyond human examples.

Earlier methods capped models at what humans could demonstrate, but this breaks that ceiling and lets reasoning emerge naturally.

Those skills include self-checking, verification, and changing strategy mid-solution, and they beat supervised baselines on tasks where answers can be checked.

Models trained this way also pass those patterns down to smaller models through distillation.

AIME 2024 pass@1 jumps from 15.6% to 77.9%, and hits 86.7% with self-consistency.

⚙️ The Core Concepts

The paper replaces human-labelled reasoning traces with answer-graded RL, so the model only gets a reward when its final answer matches ground truth, which frees it to search its own reasoning style.

The result is longer thoughts with built-in reflection, verification, and trying backups when stuck, which are exactly the skills needed for math, coding, and STEM problems where correctness is checkable.

This matters because supervised traces cap the model at human patterns, while answer-graded RL lets it discover non-human routes that still land on correct answers.

🧪 How R1-Zero is trained

R1-Zero starts from DeepSeek-V3 Base and uses GRPO, a group-based variant of PPO that compares several sampled answers per question, then pushes the policy toward the higher-reward ones while staying close to a reference model.

Training enforces a simple output structure with separate thinking and final answer sections, and uses rule-based rewards for accuracy and formatting, avoiding neural reward models that are easy to game at scale.

This minimal setup is intentional, because fewer constraints make it easier to observe what reasoning behaviours show up on their own.

Sep 16 • 11 tweets • 5 min read

Sep 15 • 8 tweets • 6 min read

Powerful new discoveries in this paper for autonomous software design.🎯

Will completely shift the way Software and AI programming will be written.

1/ Tau is in the process of constructing the next wave of AI.

Tau Language lets you write a spec of what a program should and shouldn’t do, and its logical engine automatically constructs a program mathematically guaranteed to meet your spec, removing manual implementation.

The most consuming aspect of software dev used to be writing correct code; now, it's about conveying intent accurately in specifications and getting correct-by-construction software.

This foundation is also the subject of a U.S. patent that covers using such temporal logics and Boolean‑algebraic theories for safe AI and a software‑spec logic, which matches the design.

2/ How this is different from today

With Tau, you directly state properties of the program like a formalization of “never send private data over the network”, and it produces a provably correct implementation that satisfies them.

This breaks away from today’s coding in which you write how and what a program should do at each step. And unlike Tau, in code you can’t say what the program should never do, you test and hope you covered edge cases.

In the Tau Language, programs, inputs, and outputs can be sentences in the Tau language itself, which is the first logic ever that can consistently refer to its own sentences.

Why LLMs Fall Short:

People expect deterministic and correct output from probabilistic tools, which can’t be trusted to be reliable. Imagine the disastrous results if an Airplane manufacturer decided to use code generated by LLMs, how many of you would take that flight?

Gen AI's probabilistic nature creates entropy precisely where complex systems need precision and reliability.

The V-model dev model is the standard for critical products developed for the medical industry.

The deeper you are into your V-model product development cycle, the more it costs to fix a defect.

🧩 What makes Tau Language a huge breakthough:

Tau’s Founder @ohadasor made several novel inventions, which together work as a masterpiece in theoretical computer science.

Tau Language straddles a fine line retaining decidability while being expressive enough to write specs of complex systems in their entirety, where other decidable formal languages simply aren’t strong enough.

Let’s dive deeper into Tau Language’s novel research:

🧵 1/n

🧵 2/n. 🧩 NSO (Nullary Second Order), the self‑referential core

NSO abstracts sentences so aggressively that each sentence becomes just a Boolean algebra element, which lets the logic speak about sentences in the same language without running into the usual truth paradoxes.

Because it deals with countable atomless Boolean algebras, you keep decidability, and even NSO[C] is decidable iff C is decidable, so extensions stay manageable.

This relies on the Lindenbaum–Tarski algebra view of logics, which collects sentences into equivalence classes under logical equivalence, turning syntax into algebra and letting the rest be handled with ordinary Boolean operations.

Sep 14 • 16 tweets • 7 min read

Sep 14 • 5 tweets • 3 min read

🧠 🧩 🚦 📉 ⚙️ University of Sheffield argues LLM hallucinations are mathematically inevitable.

And using confidence thresholds the way OpenAI proposes would cut hallucinations but break consumer UX and spike costs.

The core claim is that next-token generation stacks errors across a sentence, so even with perfect data the total mistake rate grows.

A language model builds a sentence word by word. At each step, it picks the next word it thinks is most likely. If it makes one small mistake early on, that mistake affects the words that come after it. The sentence then drifts further from the correct answer.

Now compare that to a yes/no question. The model only has to pick between two options: “yes” or “no.” There is just one decision, so fewer chances for error.

An "yes/no question" is like a baseline: one single prediction, no chain of dependencies. But a sentence is a long chain of predictions, and each link in the chain can go wrong.

This is why the study says the error rate for full sentences will always be at least 2 times higher than for simple yes/no answers. Because in sentences, errors can accumulate word by word, instead of being contained in a single decision.

In plain terms, incentives today still favor fast, cheap, confident replies over slower, cautious, correct ones, so hallucinations will stick around.

2/n. Rarer facts are even more prone to hallucinations.

When a model is trained, it learns facts from the data it sees. Some facts show up many times during training, so the model gets strong evidence for them. Other facts only appear once or twice, so the model has weak evidence for them.

The study gives an example with birthdays. Suppose 20% of the people in the training set only have their birthday mentioned once. For those people, the model basically has just a single memory of the fact. That is too little for the model to reliably recall it later.

As a result, when you ask about those birthdays, the model will likely get at least 20% of them wrong — because those facts were too rare in its training data.

Sep 12 • 6 tweets • 2 min read

Congrats to @CreaoAI for hitting #1 on Product Hunt (Sept 11) 🚀

just used it myself, and quite smooth experience.

CREAO is an AI Agent that builds full-stack mini-SaaS from one sentence.

One sentence in → frontend + backend + data layer out.

They are building a platform to provide the critical interface for people to build apps where humans and AI agents can collaborate seamlessly.

So its entire infrastructure is engineered with an "AI-native first" philosophy.

🧵1/n.

🧵2/n. ⚡ All-in-one build.

CREAO gave me a deployable product — frontend, backend, database together.

#1 on Product Hunt (Sept 11).

Sep 11 • 5 tweets • 3 min read

🇨🇳China unveils world's first brain-like AI Model SpikingBrain1.0

Upto 100X faster while being trained on less than 2% of the data typically required.

Designed to mimic human brain functionality, uses much less energy. A new paradigm in efficiency and hardware independence.

Marks a significant shift from current AI architectures

Unlike models such as GPT and LLaMA, which use attention mechanisms to process all input in parallel, SpikingBrain1.0 employs localized attention, focusing only on the most relevant recent context.

Potential Applications:

- Real-time, low-power environments
- Autonomous drones and edge computing
- Wearable devices requiring efficient processing
- Scenarios where energy consumption is critical

This project is part of a larger scientific pursuit of neuromorphic computing, which aims to replicate the remarkable efficiency of the human brain, which operates on only about 20 watts of power.

---

arxiv .org/abs/2509.05276

🧠 The idea for the Human-brain-inspired linear or hybrid-linear LLMs for the SpikingBrain architecture.

- SpikingBrain replaces most quadratic attention with linear and local attention, mixes in selective full attention where it matters, and adds an adaptive spiking activation so the model computes only on meaningful events.

- It proves the whole recipe works at scale by training and serving on MetaX C550 GPUs, which are non‑NVIDIA devices, without giving up quality on common benchmarks.

- The headline efficiencies come from 3 levers working together, linear attention for compressed memory, MoE for token-wise sparsity, and spiking for micro-level sparsity.

Share this page!

Enter URL or ID to Unroll