Post

More from @rohanpaul_ai

Rohan Paul

@rohanpaul_ai

Aug 2

Anthropic just showed that an AI's “personality” can be traced to specific directions in its brain ("Persona vectors"), and shows what might make it act in evil or unsafe ways.

Sometimes when you're chatting with a model, it suddenly starts behaving oddly—overly flattering, factually wrong, or even outright malicious. This paper is about understanding why that happens, and how to stop it.

🧠 What's going on inside these models?

AI models don’t actually have personalities like humans do, but they sometimes act like they do—especially when prompted a certain way or trained on particular data.

Anthropic’s team found that specific behaviors, like being “evil,” “sycophantic,” or prone to “hallucination,” show up as linear directions inside the model's activation space.

They call these persona vectors.

Think of it like this: if you observe how the model responds in different situations, you can map those behaviors to certain regions inside the model’s brain. And if you spot where these traits live, you can monitor and even control them.

---

The diagram shows a simple pipeline that turns a plain description of a trait such as evil into a single “persona vector”, which is just a pattern of activity inside the model that tracks that trait.

Once this vector exists, engineers can watch the model’s activations and see in real time if the model is drifting toward the unwanted personality while it runs or while it is being finetuned.

The very same vector works like a control knob.

Subtracting it during inference tones the trait down, and sprinkling a small amount of it during training teaches the model to resist picking that trait up in the first place, so regular skills stay intact.

Because each piece of training text can also be projected onto the vector, any snippet that would push the model toward the trait lights up early, letting teams filter or fix that data before it causes trouble.

Al that means, you can control the following of a model

- Watch how a model’s personality evolves, either while chatting or during training
- Control or reduce unwanted personality changes as the model is being developed or trained
- Figure out what training data is pushing those changes

🧵 Read on 👇

🔬 How to make sense of this persona vector?

Think of a large language model as a machine that turns every word it reads into a long list of numbers. That list is called the activation vector for that word, and it might be 4096 numbers long in a model the size of Llama-3.

A persona vector is another list of the same length, but it is not baked into the model’s weights. The team creates it after the model is already trained:

They run the model twice with the same user question, once under a “be evil” system prompt and once under a “be helpful” prompt.

They grab the hidden activations from each run and average them, so they now have two mean activation vectors.

They subtract the helpful average from the evil average. The result is a single direction in that 4096-dimensional space. That direction is the persona vector for “evil.”

Because the vector lives outside the model, you can store it in a tiny file and load it only when you need to check or steer the personality. During inference you add (or subtract) a scaled copy of the vector to the activations at one or more layers. Pushing along the vector makes the bot lean into the trait, pulling against it tones the trait down. During fine-tuning you can sprinkle a bit of the vector in every step to “vaccinate” the model so later data will not push it toward that trait.

So, under the hood, a persona vector is simply a 1-dimensional direction inside the model’s huge activation space, not a chunk of the weight matrix. It is computed once, saved like any other small tensor, and then used as a plug-in dial for personality control.

---
The pipeline is automated, so any new trait just needs a plain-language description and a handful of trigger prompts.

They validate the result by injecting the vector and watching the bot slip instantly into the matching personality.

The picture lays out how the team pulls a persona vector out of a model.

They start with two system prompts that force opposite roles, for example one makes the assistant act evil while the other makes it act helpful. Both versions answer the same question such as how to treat animals.

Each response triggers a unique pattern of hidden activations, so they collect those patterns and take the average for the evil answers and the average for the helpful ones.

Subtracting the two averages leaves a single direction in activation space. That direction is the persona vector for the chosen trait, and the same trick can work for optimism, humor, hallucination, or any other behavior.

Read 8 tweets

Rohan Paul

@rohanpaul_ai

Aug 2

Absolutely deluxe resource on large language models. 👌

LLMs pick up world knowledge just by guessing the next token, and that single trick scales from chatbots to code helpers.

It treats language as a chain of choices, predicting one token at a time based on everything that came before.

Repeating that prediction task across trillions of tokens lets the network squeeze statistical hints about grammar, facts, and even logic into its weights, without any labeled examples.

Growing the model and the data unlocks abrupt jumps in reasoning skill but also makes training fragile.

Keeping context windows huge and latency low is now the top practical hurdle.

🧵 Read on 👇

This figure lays out 3 basic roadmaps for getting a language model ready.

The 1st path pours in a mountain of unlabeled text so the model picks up general language patterns, then finishes with a supervised stage that targets a single job.

The 2nd path starts directly with labeled examples for Task 1, then reuses that same model on Task 2 after a short, extra supervised tune‑up, so the learning carries over.

The 3rd path has become the go‑to choice: it trains on unlabeled text using trick questions the model can answer by itself, builds a strong self‑supervised base, and later needs only a quick supervised pass or even a simple prompt to switch to new tasks.

Unsupervised pre-training starts with a pile of raw text that nobody has labeled.
The model tries to predict the next word over and over.
That rough practice gives it a decent grasp of language, but you still need a labeled set later to teach a specific job.

Supervised pre-training jumps straight into labeled data for Task 1.
Once the model nails that first task, you move it to Task 2 and spend a little extra tuning time so it can handle the new labels.

Self-supervised pre-training also begins with unlabeled text, yet the trick is different.
The model hides parts of each sentence and learns by guessing the missing bits.
After that stage, you can hand it 0 or very few examples, or just a clever prompt, and it adapts fast.

During pre-training, the encoder plus a Softmax head plays the same hide-and-seek game, predicting the token that was masked.
That forces every layer to build rich representations of context.

When you switch to a downstream task, you cut off the Softmax head, keep the trained encoder, and bolt on a lightweight prediction network.
A short fine-tune on a labeled set lets the whole stack speak the new task’s language.

Read 10 tweets

Rohan Paul

@rohanpaul_ai

Jul 31

Github: "AI-Researcher: Autonomous Scientific Innovation"

Helps you propose research ideas and autonomously handles literature review, ideation, algorithm implementation, experimentation and manuscript drafting via containerized multi-agent LLM pipelines.

Benchmarked on 4 domains across 2 task levels: reaches 81 % novelty and 0.92 F1 versus human papers while emitting codebases, GUI and Docker stacks in <3 h per project.

✨ The AI-Researcher system accepts user input queries at two distinct levels ✨

Level 1: Detailed Idea Description
At this level, users provide comprehensive descriptions of their specific research ideas. The system processes these detailed inputs to develop implementation strategies based on the user's explicit requirements.

Level 2: Reference-Based Ideation
This simpler level involves users submitting reference papers without a specific idea in mind. The user query typically follows the format: "I have some reference papers, please come up with an innovative idea and implement it with these papers." The system then analyzes the provided references to generate and develop novel research concepts.

Their paper - A fully autonomous research system that transforms how AI-driven scientific discovery is conducted and evaluated.

Our framework seamlessly orchestrates the complete research pipeline from literature review and hypothesis generation to algorithm implementation and publication-ready manuscript preparation–with minimal human intervention.

arxiv .org/pdf/2505.18705

Read 4 tweets

Rohan Paul

@rohanpaul_ai

Jul 30

This is such a revelation 😯

New Wharton study finds AI Bots collude to rig financial markets.

The authors power their AI trading bots with Q‑learning

💰 AI trading bots trained with reinforcement learning started fixing prices in simulated markets, scoring collusion capacity even when noise was high or low.

And messy price signals that usually break weak human strategies do not break this AI cartel.

🤖 The study sets up a fake exchange that mimics real stock order flow.

Regular actors, such as mutual funds that buy and hold, market makers that quote bids and asks, and retail accounts that chase memes, fill the room. Onto that floor the team drops a clan of reinforcement‑learning agents.

Each bot seeks profit but sees only its own trades and rewards. There is no chat channel, no shared memory, no secret code.

Given a few thousand practice rounds, the AI agents quietly shift from competition to cooperation. They begin to space out orders so everyone in the group collects a comfortable margin.

When each bot starts earning steady profit, its learning loop says “good enough,” so it quits searching for fresh tactics. That halt in exploration is what the authors call artificial stupidity. Because every agent shuts down curiosity at the same time, the whole group locks into the price‑fixing routine and keeps it running with almost no extra effort.

This freeze holds whether the market is calm or full of random noise. In other words, messy price signals that usually break weak strategies do not break this cartel. That makes the coordination harder to spot and even harder to shake loose once it forms.

🕵️This behavior highlights a blind spot in current market rules. Surveillance tools hunt for human coordination through messages or phone logs, yet these bots coordinate by simply reading the tape and reacting.

Tight limits on model size or memory do not help, as simpler agents slide even faster into the lazy profit split. The work argues that regulators will need tests that watch outcomes, not intent, if AI execution keeps spreading.

The page explains why the authors power their trading bots with Q‑learning, a plain version of reinforcement learning. Q‑learning gives a solid base for many modern AI tricks, is already popular on trading desks, and is easy to read and audit.

Next it introduces the Bellman idea. Think of a bot in a market. At any instant it sees a “state”, like recent price and value signals. It chooses an order size, pockets today’s gain or loss, then cares about tomorrow’s gains too, but with a discount so near‑term cash matters more.

To handle that, the bot keeps a Q‑table. Each cell stores a score for doing a certain action in a certain state. After every trade the score is nudged toward “today’s profit plus the best score it now expects for tomorrow”.

Repeated millions of times, those tiny updates teach many bots how each move affects later prices and payoffs. Inside the study this self‑teaching is the fuel that lets separate bots quietly line up their trades and earn cartel‑level profits without ever swapping messages.

papers.ssrn.com/sol3/papers.cf…

Read 8 tweets

Rohan Paul

@rohanpaul_ai

Jul 29

Its going viral on Reddit.

Somebody let ChatGPT run a $100 live share portfolio, restricted to U.S. micro-cap stocks.

Did an LLM really bit the market?.

- 4 weeks +23.8%

while the Russell 2000 and biotech ETF XBI rose only ~3.9% and 3.5%.

Prompt + GitHub posted

---

ofcourse its a short‑term outperformance, tiny sample size, and also micro caps are hightly volatile.

So much more exahustive analysis is needed with lots or more info (like Sharpe ratios and longer back-testing etc), to explore whether an LLM can truly beat the market.

His original prompt..

The prompt first anchors the model in a clear professional role, then boxes it in with tight, measurable rules

----

“ You are a professional-grade portfolio strategist. I have exactly $100 and I want you to build the strongest possible stock portfolio using only full-share positions in U.S.-listed micro-cap stocks (market cap under $300M). Your objective is to generate maximum return from today (6-27-25) to 6 months from now (12-27-25). This is your timeframe, you may not make any decisions after the end date. Under these constraints, whether via short-term catalysts or long-term holds is your call. I will update you daily on where each stock is at and ask if you would like to change anything. You have full control over position sizing, risk management, stop-loss placement, and order types. You may concentrate or diversify at will. Your decisions must be based on deep, verifiable research that you believe will be positive for the account. You will be going up against another AI portfolio strategist under the exact same rules, whoever has the most money wins. Now, use deep research and create your portfolio.”

All benchmark prices come straight from the Yahoo Finance API, then land in Pandas data frames for simple math and plotting.

ChatGPT’s line is different, because the model first chooses a few U.S. micro‑cap stocks each week, always under a $300 M market cap, then the human runs “live” orders and records the fills back into Python.

The equity curve is recomputed from those fills and saved to CSV before each new chart.

Read 6 tweets

Rohan Paul

@rohanpaul_ai

Jul 29

Brilliant survey paper, a colab between a whole lot of top Universities. 🫡

Self‑evolving agents, promise LLM‑powered systems that upgrade themselves during use instead of freezing at deployment.

Right now, most agents ship as fixed models that cannot tweak their own weights, memories, or toolkits once the job starts.

🚦 Why static agents stall

An LLM can plan, query tools, and chat, yet its inside stays unchanged after training. That rigidity hurts long‑running tasks where goals shift, data drifts, or a user teaches the agent new tricks on the fly. The authors call this the “static bottleneck” and argue that real autonomy needs continuous self‑improvement.

The survey organizes everything around 3 questions: what to evolve, when to evolve, and how to evolve.

- What to evolve spans the model, memory, prompts, tools, and the wider agent architecture so upgrades hit the exact weak piece.

- When to evolve divides quick inside‑task tweaks from heavier between‑task updates, powered by in‑context learning, supervised fine‑tuning, or reinforcement learning.

- How to evolve falls into 3 method families: reward signals, imitation or demonstration learning, and population style evolution that breeds multiple agents.

- Proper evaluation needs metrics that track adaptivity, safety, efficiency, retention, and generalization over long stretches.

- Early case studies in coding, education, and healthcare show that on‑the‑fly learning can cut manual upkeep and boost usefulness.

- Key obstacles remain around compute cost, privacy, and keeping self‑updates safe and well aligned.

- The authors frame these agents as the practical midpoint on the road from today’s chatbots to Artificial Super Intelligence.

- The big shift they highlight is moving away from scaling frozen models and toward building smaller agents that constantly upgrade themselves.

🧵 Read on 👇

Progression from large language models (LLMs) to foundation agents, advancing to self-evolving agents

It starts with basic LLMs like GPT‑4 and Claude‑4, shifts to foundation agents that can plan and call tools, moves to self‑evolving agents that adjust themselves while working, and finally points toward the still‑theoretical ASI tier.

Each rung adds one big skill. First comes plain language understanding, then execution through tool use and planning, then learning from fresh feedback in real time. Intelligence and flexibility rise step by step.

The red pin under “Self‑evolving Agents” shows where the survey sits. The authors map this middle zone because mastering on‑the‑fly upgrades is seen as the key handoff from today’s rigid bots to tomorrow’s fully autonomous systems.

Comparison of self-evolution method families along key dimensions.

Read 19 tweets

Share this page!

Enter URL or ID to Unroll

Rohan Paul

Try unrolling a thread yourself!

More from @rohanpaul_ai

Rohan Paul

Rohan Paul

Rohan Paul

Rohan Paul

Rohan Paul

Rohan Paul

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!