Rohan Paul Profile picture
May 28, 2022 6 tweets 16 min read Read on X
Kullback-Leibler (KL) Divergence - A Thread

It is a measure of how one probability distribution diverges from another expected probability distribution.

#DataScience #Statistics #DeepLearning #ComputerVision #100DaysOfMLCode #Python #programming #ArtificialIntelligence #Data
KL Divergence has its origins in information theory. The primary goal of information theory is to quantify how much information is in data. The most important metric in information theory is called Entropy

#DataScience #Statistics #DeepLearning #ComputerVision #100DaysOfMLCode

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Rohan Paul

Rohan Paul Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rohanpaul_ai

Sep 14
One of the best paper of the recent week.

The big takeaway: scaling up model size doesn’t just make models smarter in terms of knowledge, it makes them last longer on multi-step tasks, which is what really matters for agents.

Shows that small models can usually do one step perfectly, but when you ask them to keep going for many steps, they fall apart quickly.

Even if they never miss on the first step, their accuracy drops fast as the task gets longer.

Large models, on the other hand, stay reliable across many more steps, even though the basic task itself doesn’t require extra knowledge or reasoning.

The paper says this is not because big models "know more," but because they are better at consistently executing without drifting into errors

The paper names a failure mode called self-conditioning, where seeing earlier mistakes causes more mistakes, and they show that with thinking steps GPT-5 runs 1000+ steps in one go while others are far lower.

🧵 Read on 👇Image
🧵2/n. 🧠 The idea

The work separates planning from execution, then shows that even when the plan and the needed knowledge are handed to the model, reliability drops as the task gets longer, which makes small accuracy gains suddenly matter a lot. Image
🧵3/n. even a tiny accuracy boost at the single-step level leads to exponential growth in how long a model can reliably execute a full task.

This is why scaling up models is still worth it, even if short benchmarks look like progress is stalling.

On the left, you see that step accuracy (how often the model gets each small move right) is almost flat, barely improving with newer models.

That looks like diminishing returns, because each release is only slightly better at a single step.

But on the right, when you extend that tiny step improvement across many steps in a row, the gains explode.

Task length (how long a model can keep going without failing) jumps from almost nothing to thousands of steps.Image
Read 16 tweets
Sep 14
🧠 🧩 🚦 📉 ⚙️ University of Sheffield argues LLM hallucinations are mathematically inevitable.

And using confidence thresholds the way OpenAI proposes would cut hallucinations but break consumer UX and spike costs.

The core claim is that next-token generation stacks errors across a sentence, so even with perfect data the total mistake rate grows.

A language model builds a sentence word by word. At each step, it picks the next word it thinks is most likely. If it makes one small mistake early on, that mistake affects the words that come after it. The sentence then drifts further from the correct answer.

Now compare that to a yes/no question. The model only has to pick between two options: “yes” or “no.” There is just one decision, so fewer chances for error.

An "yes/no question" is like a baseline: one single prediction, no chain of dependencies. But a sentence is a long chain of predictions, and each link in the chain can go wrong.

This is why the study says the error rate for full sentences will always be at least 2 times higher than for simple yes/no answers. Because in sentences, errors can accumulate word by word, instead of being contained in a single decision.

In plain terms, incentives today still favor fast, cheap, confident replies over slower, cautious, correct ones, so hallucinations will stick around.Image
2/n. Rarer facts are even more prone to hallucinations.

When a model is trained, it learns facts from the data it sees. Some facts show up many times during training, so the model gets strong evidence for them. Other facts only appear once or twice, so the model has weak evidence for them.

The study gives an example with birthdays. Suppose 20% of the people in the training set only have their birthday mentioned once. For those people, the model basically has just a single memory of the fact. That is too little for the model to reliably recall it later.

As a result, when you ask about those birthdays, the model will likely get at least 20% of them wrong — because those facts were too rare in its training data.Image
3/n. They also audit benchmarks and find 9 out of 10 use binary grading that gives 0 for “I do not know,” which mathematically rewards confident guessing over honest uncertainty.

OpenAI’s fix is to make the model answer only when its internal confidence clears a bar, for example 75%, which reduces false claims by encouraging abstention when unsure.

The tradeoff is user experience, since a realistic threshold could make the assistant say “I do not know” on roughly 30% of everyday queries.

There is also a compute tax, because estimating calibrated confidence and exploring alternatives means more forward passes and routing, so per-query cost climbs at consumer scale.

The calculus shifts in high-stakes settings, where the cost of a wrong answer beats the extra compute, so confidence gating and active questioning make sense.
Read 5 tweets
Sep 12
Congrats to @CreaoAI for hitting #1 on Product Hunt (Sept 11) 🚀

just used it myself, and quite smooth experience.

CREAO is an AI Agent that builds full-stack mini-SaaS from one sentence.

One sentence in → frontend + backend + data layer out.

They are building a platform to provide the critical interface for people to build apps where humans and AI agents can collaborate seamlessly.

So its entire infrastructure is engineered with an "AI-native first" philosophy.

🧵1/n.
🧵2/n. ⚡ All-in-one build.

CREAO gave me a deployable product — frontend, backend, database together.

#1 on Product Hunt (Sept 11). Image
🧵3/n. 🔌 I connected real-time APIs via the MCP marketplace.

In my view, this is where AI tooling has to go: not isolation, but live data + integrations.
Read 6 tweets
Sep 11
🇨🇳China unveils world's first brain-like AI Model SpikingBrain1.0

Upto 100X faster while being trained on less than 2% of the data typically required.

Designed to mimic human brain functionality, uses much less energy. A new paradigm in efficiency and hardware independence.

Marks a significant shift from current AI architectures

Unlike models such as GPT and LLaMA, which use attention mechanisms to process all input in parallel, SpikingBrain1.0 employs localized attention, focusing only on the most relevant recent context.

Potential Applications:

- Real-time, low-power environments
- Autonomous drones and edge computing
- Wearable devices requiring efficient processing
- Scenarios where energy consumption is critical

This project is part of a larger scientific pursuit of neuromorphic computing, which aims to replicate the remarkable efficiency of the human brain, which operates on only about 20 watts of power.

---

arxiv .org/abs/2509.05276Image
🧠 The idea for the Human-brain-inspired linear or hybrid-linear LLMs for the SpikingBrain architecture.

- SpikingBrain replaces most quadratic attention with linear and local attention, mixes in selective full attention where it matters, and adds an adaptive spiking activation so the model computes only on meaningful events.

- It proves the whole recipe works at scale by training and serving on MetaX C550 GPUs, which are non‑NVIDIA devices, without giving up quality on common benchmarks.

- The headline efficiencies come from 3 levers working together, linear attention for compressed memory, MoE for token-wise sparsity, and spiking for micro-level sparsity.Image
🛠️ Training without starting from scratch

They do a conversion‑based continual pre‑training, not full pre‑training, by remapping QKV weights from a Transformer checkpoint into linear and local attention, then training for ~150B tokens across 8k, 32k, and 128k contexts.

Because the converted attention maps stay close to the original softmax map, the model converges quickly and avoids the ~10T token budgets seen in many scratch runs, which is <2% of typical data.

Post‑training then adds instruction following and reasoning in 3 short stages without harming the base capabilities.Image
Read 5 tweets
Sep 11
Fantastic paper from ByteDance 👏

Shows how to train LLM agents to finish long, multi step tasks by letting them act in real environments with reinforcement learning.

Across 27 tasks, the trained agents rival or beat top proprietary models.

Most agents are trained on single turn data, so they fail when a job needs many decisions with noisy feedback.

AgentGym-RL splits the system into separate parts, the environments, the agent loop, and training, so each can improve on its own.

It supports mainstream algorithms and realistic tasks, and the agent learns by acting, seeing results, and adjusting across different settings.

The key method, ScalingInter-RL, starts with short interactions to master basics, then slowly allows longer runs so the agent can explore and plan.

This staged horizon schedule stabilizes learning, prevents pointless loops, and encourages planning, reflection, and recovery after mistakes.

A 7B model trained with this setup matches or beats much larger open models and competes well with strong commercial ones.

They also find that putting more compute into training and test time interaction, like more steps or samples, often helps more than adding parameters.Image
How the AgentGym-RL framework works.

At the center is the LLM agent. It takes an instruction, interacts with an environment for several turns, and then produces actions. Each action changes the environment, and the environment sends feedback back to the agent. This cycle repeats many times.

The environment itself is handled by a server that can simulate different types of tasks. These include web browsing, searching, coding, playing games, doing science tasks, or controlling embodied agents. The environment client manages the interaction and communicates through standard protocols.

Every full cycle of actions and observations is called a trajectory. These trajectories are collected and then used to update the agent’s policy with reinforcement learning algorithms like PPO, GRPO, RLOO, or REINFORCE++.

The framework is modular. The environment, the agent, and the training part are separated. This makes it flexible, easy to extend, and suitable for many types of realistic tasks.

The diagram highlights how the agent learns not by memorizing answers, but by trying actions, getting feedback, and improving its decision making across different domains.Image
The idea behind ScalingInter-RL, the training method used in the paper.

If an agent is trained with only short interactions, it learns to handle easy tasks but fails on harder ones. If it is trained with very long interactions from the start, it wastes effort, falls into repeated mistakes, or even collapses and performs poorly.

ScalingInter-RL solves this by gradually increasing the number of interaction steps during training. At first, the agent works with short horizons to master the basics and build reliable skills.

Then, the horizon is expanded in stages, letting the agent explore more, refine its behavior, and learn how to recover from errors.

By the final stages, the agent can manage long, complex tasks because it has grown its abilities step by step instead of being overloaded too early. This staged process makes training stable and produces stronger agents.Image
Read 4 tweets
Sep 9
📢 Another Brilliant research just dropped from @GoogleResearch - a major advancement for a systematic way to generate expert-level scientific software automatically.

An LLM plus tree search turns scientific coding into a score driven search engine.

This work builds an LLM + Tree Search loop that writes and improves scientific code by chasing a single measurable score for each task.

The key idea is to treat coding for scientific tasks as a scorable search problem.

That means every candidate program can be judged by a simple numeric score, like how well it predicts, forecasts, or integrates data. Once you have a clear score, you can let a LLM rewrite code again and again, run the code in a sandbox, and use tree search to keep the best branches while discarding weaker ones

With compact research ideas injected into the prompt, the system reaches expert level and beats strong baselines across biology, epidemiology, geospatial, neuroscience, time series, and numerical methods.

Training speed: less than 2 hours on 1 T4 vs 36 hours on 16 A100s.

In bioinformatics, it came up with 40 new approaches for single-cell data analysis that beat the best human-designed methods on a public benchmark.

In epidemiology, it built 14 models that set state-of-the-art results for predicting COVID-19 hospitalizations.

🧵 Read on 👇Image
🧵2/n. ⚙️ The Core Concepts

Empirical software is code built to maximize a quality score on observed data, and any task that fits this framing becomes a scorable task.

This view turns software creation into a measurable search problem, because every candidate program is judged by the same numeric target.

This framing also explains why the method can travel across domains, since only the scoring function changes.Image
🧵3/n. This figure is breaking down both how the system works.

The top-left part shows the workflow. A scorable problem and some research ideas are given to an LLM, which then generates code. That code is run in a sandbox to get a quality score. Tree search is used to decide which code branches to keep improving, balancing exploration of new ideas with exploitation of ones that already look promising.

On the right, different ways of feeding research ideas into the system are shown. Ideas can come from experts writing direct instructions, from scientific papers that are summarized, from recombining prior methods, or from LLM-powered deep research. These sources make the search more informed and help the model produce stronger, more competitive solutions.

So overall, the loop of tree search plus targeted research ideas turns an LLM from a one-shot code generator into a system that steadily climbs toward expert-level performance.Image
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(