Post

More from @rohanpaul_ai

Rohan Paul

@rohanpaul_ai

Oct 15

BIG success for LLMs in financial trading & decision making.

New Stanford + Univ California study proves a 4B financial-domain model, Trading-R1, writes clear analyst theses and turns them into profitable trades.

Its trained on 100K cases over 18 months across 14 tickers, and its backtests show better risk-adjusted returns with smaller drawdowns.

The problem it tackles is simple, quant models are hard to read, and general LLMs write nice text that does not translate into disciplined trades.

The solution starts by forcing a strict thesis format, with separate sections for market data, fundamentals, and sentiment, and every claim must point to evidence from the given context.

Then it learns decisions by mapping outcomes into 5 labels, strong buy, buy, hold, sell, strong sell, using returns that are normalized by volatility over several horizons.

For training, it first copies high-quality reasoning distilled from stronger black-box models using supervised fine-tuning, then it improves with a reinforcement method called group relative policy optimization.

In held-out tests on NVDA, AAPL, AMZN, META, MSFT, and SPY, the combined approach beats small and large baselines on Sharpe and max drawdown, and the authors position it as research support, not high-frequency automation.

🧵 Read on 👇

🧵2/n. The 3 steps used to train Trading-R1.

The first step is Structure. The model is taught how to write a thesis in a clear format. It must separate parts like market trends, company fundamentals, and sentiment, and it has to place each claim in the right section.

The second step is Claims. Here the model learns that any claim it makes must be supported by evidence. For example, if it says revenue is growing, it must back that with a source or number provided in the context.

The third step is Decision. The model turns the structured thesis into an actual trading action. It predicts outcomes like strong buy, buy, hold, sell, or strong sell. Its prediction is checked against the true outcome, and it gets rewards or penalties depending on accuracy.

Each step first uses supervised fine-tuning, which means training on examples with correct answers, and then reinforcement fine-tuning, which means refining the model by giving rewards when it produces better outputs.

Finally, all stages are combined, producing Trading-R1, a model that can both write well-structured financial reasoning and map that reasoning into actual trading decisions.

🧵3/n. Three-Stage Financial Trading Model Training Pipeline

In Structure, the model learns to write in a clear format and keep sections organized.

In Claims, it learns to back every statement with quotes or sources, reducing hallucinations.

In Decision, it learns to turn the structured reasoning into buy, hold, or sell calls that are market-aware.

Each stage mixes supervised fine-tuning, reinforcement fine-tuning, and filtering of good examples to steadily improve.

Read 7 tweets

Rohan Paul

@rohanpaul_ai

Oct 12

"The Impact of Artificial Intelligence on Human Thought"

A big 132 page report.

AI is shifting real thinking work onto external systems, which boosts convenience but can weaken the effort that builds understanding and judgment,

A pattern the paper frames through cognitive offloading and cognitive load theory, and then tracks into social effects like standardized language and biased information flows, and manipulation tactics that target human psychology.

It says use AI to cut noise and routine steps, keep humans doing the heavy mental lifting, and add controls because personalization, deepfakes, and opaque models can steer choices at scale.

🧵 Read on 👇

🧵2/n. ⚙️ The Core Concepts

Cognitive load theory says working memory is limited, so AI helps when it reduces extraneous load and hurts when it replaces the germane load needed to build skill.

In plain terms, let tools clean up the interface and fetch data, but keep people doing the analysis, explanation, and sense‑making.

🧵3/n. 🧰 Offloading and memory

Handing memory, calculation, or choosing to an external aid frees attention now, yet steady offloading can dull recall and critical habits later.

The paper casts web search, note apps, and assistants as a human‑machine transactive memory system, useful when sources are reliable, risky when they are biased or wrong.

That is why trust and verification routines matter as much as speed.

Read 11 tweets

Rohan Paul

@rohanpaul_ai

Oct 10

Rude prompts to LLMs consistently lead to better results than polite ones 🤯

The authors found that very polite and polite tones reduced accuracy, while neutral, rude, and very rude tones improved it.

Statistical tests confirmed that the differences were significant, not random, across repeated runs.

The top score reported was 84.8% for very rude prompts and the lowest was 80.8% for very polite.

They compared their results with earlier studies and noted that older models (like GPT-3.5 and Llama-2) behaved differently, but GPT-4-based models like ChatGPT-4o show this clear reversal where harsh tone works better.

----

Paper – arxiv. org/abs/2510.04950

Paper Title: "Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper)"

Average accuracy and range across 10 runs for five different tones

https://x.com/rohanpaul_ai/status/1976949304733368392

https://x.com/rohanpaul_ai/status/1976949304733368392

Read 4 tweets

Rohan Paul

@rohanpaul_ai

Oct 9

This is one of THE BRILLIANT papers with a BIG claim. 👏

Giving an LLM just 78 carefully chosen, full workflow examples makes it perform better at real agent tasks than training it with 10,000 synthetic samples.

"Dramatically outperforms SOTA models: Kimi-K2-Instruct, DeepSeek-V3.1, Qwen3-235B-A22B-Instruct and GLM-4.5. " on AgencyBench (LIMI at 73.5%)

The big deal is that quality and completeness of examples matter way more than raw data scale when teaching models how to act like agents instead of just talk.

They name the Agency Efficiency Principle, which says useful autonomy comes from a few high quality demonstrations of full workflows, not from raw scale.

The core message is strategic curation over scale for agents that plan, use tools, and finish work.

🧵 Read on 👇

🧵2/n. In summary how LIMI (Less Is More for Intelligent Agency) can score so high with just 78 examples.

1. Each example is very dense
Instead of short one-line prompts, each example is a full workflow. It contains planning steps, tool calls, human feedback, corrections, and the final solution. That means 1 example teaches the model dozens of small but connected behaviors.

2. The tasks are carefully chosen
They didn’t just collect random problems. They picked tasks from real coding and research workflows that force the model to show agency: breaking down problems, tracking state, and fixing mistakes. These skills generalize to many other tasks.

3. Complete trajectories, not fragments
The dataset logs the entire process from the first thought to the final answer. This is like showing the model not only the answer key but the full worked-out solution, so it can copy the reasoning pattern, not just the result.

4. Less noise, more signal
Large datasets often have lots of filler or synthetic tasks that don’t push real agent skills. LIMI avoids that by strict quality control, so almost every token in the dataset contributes to useful learning.

5. Scale of information per token
Because each trajectory is huge (tens of thousands of tokens), the model effectively sees way more “learning signal” than the raw count of 78 samples suggests. The richness of one trajectory can outweigh hundreds of shallow synthetic prompts.

🧵3/n.An example of the user query.

Shows how a single query encompasses multiple interconnected subtasks across planning, execution, and collaboration dimensions, demonstrating the density of learning signals in high-quality demonstrations.

Read 12 tweets

Rohan Paul

@rohanpaul_ai

Oct 5

Absolutely classic @GoogleResearch paper on In-Context-Learning by LLMs.

Shows the mechanisms of how LLMs learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change.

💡The mechanism they reveal for in-context-learning.

When the model reads a few examples in your prompt, it figures out a pattern (like a small rule or function). Instead of permanently changing its stored weights, it forms a temporary adjustment that captures this pattern. That adjustment can be written mathematically as a rank-1 matrix, meaning it only adds one simple direction of change to the existing weights.

This rank-1 update is “low-rank”, so it is very cheap and compact. But it still lets the model shift its behavior to fit the examples in the prompt. Once the prompt is gone, that temporary rank-1 tweak also disappears.

So, in simple terms:
The paper shows that in-context learning happens because the model internally applies a temporary rank-1 (very simple) weight update based on your examples, instead of permanently retraining itself.

---

That behavior looks impossible if learning always means gradient descent.

The authors ask whether the transformer’s own math hides an update inside the forward pass.

They show, each prompt token writes a rank 1 tweak onto the first weight matrix during the forward pass, turning the context into a temporary patch that steers the model like a 1‑step finetune.

Because that patch vanishes after the pass, the stored weights stay frozen, yet the model still adapts to the new pattern carried by the prompt.

---

Shows that the attention part can take what it found in your prompt and package it into a tiny “instruction” that, for this 1 forward pass, acts exactly like a small temporary change to the MLP’s weights.

Nothing is saved to disk, yet the block behaves as if the MLP just got a low-rank tweak computed from your examples. Remove the prompt, the tweak disappears, the saved weights stay the same.

As the model reads your examples token by token, it keeps refining that temporary tweak. Each new token nudges the MLP a bit more toward the rule implied by your examples, similar to taking small gradient steps, again only for this pass.

When the examples have done their job, those nudges shrink toward 0, which is what you want when the pattern has been “locked in” for the current answer.

🧵 Read on 👇

🧵2/n. ⚙️ The Core Idea

They call any layer that can read a separate context plus a query a “contextual layer”.

Stack this layer on top of a normal multilayer perceptron and you get a “contextual block”.

For that block, the context acts exactly like a rank 1 additive patch on the first weight matrix, no matter what shape the attention takes.

🧵3/n. 🛠️ Temporary rank 1 patch

A transformer block first runs the self‑attention layer and gets two things for the query token: the usual activation and the tiny difference between “with context” and “without context”.

It multiplies that difference by the frozen weight matrix, then projects the result back through the query activation.

The outcome is a one‑column times one‑row outer product, so the whole tweak has rank 1 and adds almost no storage overhead.

In the very next instruction the block behaves exactly as if the real weight matrix had been replaced by that patch plus the original weights, even though nothing on disk changed .

🌀 Why the change vanishes after each run

The patch lives only inside the forward pass. Once the model finishes processing the current token, the computation graph is cleared and the base weights revert to their untouched state.

Because the next token builds its own patch from scratch, no cumulative edit sticks around in memory, yet during the pass the effect is the same as a quick one‑step fine‑tune .

Put simply, each prompt token writes a throw‑away sticky note on top of the first weight matrix, lets the model read that note to answer the query, then tosses it out before the weights ever hit the file system.

Read 9 tweets

Rohan Paul

@rohanpaul_ai

Oct 3

🚫 This @Microsoft paper brings really bad news for medical AI models. Exposes some serious flaws.

AI models just aren’t ready yet for reliable medical reasoning. 🤯

Paper finds that medical AI model pass tests by exploiting patterns in the data, not by actually combining medical text with images in a reliable way.

While medical AI models look good on benchmarks, in reality they can not handle real medical reasoning.

The key findings are that models overuse shortcuts, break under small changes, and produce unfaithful reasoning.

This makes the medical AI model's benchmark results misleading if someone assumes a high score means the model is ready for real medical use.

---

The specific key findings from this paper 👇

- Models keep strong accuracy even when images are removed, even on questions that require vision, which signals shortcut use over real understanding.

- Scores stay above the 20% guess rate without images, so text patterns alone often drive the answers.

- Shuffling answer order changes predictions a lot, which exposes position and format bias rather than robust reasoning.

- Replacing a distractor with “Unknown” does not stop many models from guessing, instead of abstaining when evidence is missing.

- Swapping in a lookalike image that matches a wrong option makes accuracy collapse, which shows vision is not integrated with text.

- Chain of thought often sounds confident while citing features that are not present, which means the explanations are unfaithful.

- Audits reveal 3 failure modes, incorrect logic with correct answers, hallucinated perception, and visual reasoning with faulty grounding.

- Gains on popular visual question answering do not transfer to report generation, which is closer to real clinical work.

- Clinician reviews show benchmarks measure very different skills, so a single leaderboard number misleads on readiness.

- Once shortcut strategies are disrupted, true comprehension is far weaker than the headline scores suggest.

- Most models refuse to abstain without the image, which is unsafe behavior for medical use.

- The authors push for a robustness score and explicit reasoning audits, which signals current evaluations are not enough.

🧵 Read on 👇

🧵2/n. The below figure tells us that high scores on medical benchmarks can mislead, because stress tests reveal that current models often rely on shallow tricks and cannot be trusted for reliable medical reasoning.

The first part highlights 3 hidden fragilities: hallucinated perception, shortcut behavior, and faulty reasoning.

The second part compares benchmark accuracy with robustness scores, and while accuracy looks high, robustness drops sharply, which means models get brittle under small changes.

The heatmap shows how stress tests like removing images, shuffling answers, or replacing distractors reveal specific failure patterns in each model.

The example at the bottom shows that a model can still give the right answer even without seeing the image, which is a shortcut, or it can make up a detailed explanation that mentions things not actually in the image, which is fabricated reasoning.

🧵3/n. This figure shows that removing images from diagnostic questions makes accuracy drop, so models are overestimating how much they truly use vision.

Different benchmarks react differently, which means visual understanding is inconsistent across datasets and question types.

Even without images, most models still score above the 20% guess rate, which signals they rely on text cues, memorized pairs, or co-occurrence patterns.

One model even falls below chance on text-only inputs, which suggests fragile behavior rather than stable reasoning.

Overall, the message is that high Image+Text scores can hide shortcut use, so vision-language robustness is weaker than the headline numbers suggest.

Read 5 tweets

Share this page!

Enter URL or ID to Unroll

Rohan Paul

Try unrolling a thread yourself!

More from @rohanpaul_ai

Rohan Paul

Rohan Paul

Rohan Paul

Rohan Paul

Rohan Paul

Rohan Paul

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!