Rohan Paul Profile picture
May 28, 2022 6 tweets 16 min read Read on X
Kullback-Leibler (KL) Divergence - A Thread

It is a measure of how one probability distribution diverges from another expected probability distribution.

#DataScience #Statistics #DeepLearning #ComputerVision #100DaysOfMLCode #Python #programming #ArtificialIntelligence #Data
KL Divergence has its origins in information theory. The primary goal of information theory is to quantify how much information is in data. The most important metric in information theory is called Entropy

#DataScience #Statistics #DeepLearning #ComputerVision #100DaysOfMLCode

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Rohan Paul

Rohan Paul Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rohanpaul_ai

Jul 26
MASSIVE claim in this paper.

AI Architectural breakthroughs can be scaled computationally, transforming research progress from a human-limited to a computation-scalable process.

So it turns architecture discovery into a compute‑bound process, opening a path to self‑accelerating model evolution without waiting for human intuition.

The paper shows that an all‑AI research loop can invent novel model architectures faster than humans, and the authors prove it by uncovering 106 record‑setting linear‑attention designs that outshine human baselines.

Right now, most architecture search tools only fine‑tune blocks that people already proposed, so progress crawls at the pace of human trial‑and‑error.

🧩 Why we needed a fresh approach

Human researchers tire quickly, and their search space is narrow. As model families multiply, deciding which tweak matters becomes guesswork, so whole research agendas stall while hardware idles.

🤖 Meet ASI‑ARCH, the self‑driving lab

The team wired together three LLM‑based roles. A “Researcher” dreams up code, an “Engineer” trains and debugs it, and an “Analyst” mines the results for patterns, feeding insights back to the next round. A memory store keeps every motivation, code diff, and metric so the agents never repeat themselves.

📈 Across 1,773 experiments and 20,000 GPU hours, a straight line emerged between compute spent and new SOTA hits.

Add hardware, and the system keeps finding winners without extra coffee or conferences.Image
📈 Across 1,773 experiments and 20,000 GPU hours, a straight line emerged between compute spent and new SOTA hits.

Add hardware, and the system keeps finding winners without extra coffee or conferences. Image
Examples like PathGateFusionNet, ContentSharpRouter, and FusionGatedFIRNet beat Mamba2 and Gated DeltaNet on reasoning suites while keeping parameter counts near 400M. Each one solves the “who gets the compute budget” problem in a new way, often by layering simple per‑head gates instead of a single softmax.Image
Read 10 tweets
Jul 25
This is incredible. 😯

@memories_ai just released world’s first Large Visual Memory Model (LVMM) with unlimited visual memory for AI.

To give AI human-like visual memories. Video understanding with ultra-low hallucinations on an unlimited context window.

Their "context window is virtually unlimited. Yes, you read that right."

Some usecases - 👇

- You can now ask questions like "Show me all unattended bags in the main terminal" and instantly search massive video archives.

- They indexed 1M TikTok videos, so you can ask things like "What’s the viral cosmetics trend?" or "Which influencer featured Tesla cars?" across millions of posts.

So HOW does it do it?

💡 It shrinks each frame into a lightweight “memory atom,” files those atoms in a search‑style index, then pulls back just the relevant atoms when someone asks a question.

🏗️ The trick removes the usual context cap, so answer quality stays high even after 1M+ frames.

The usual video model drags the whole clip into its attention buffer.

That buffer explodes once the clip runs past a few thousand frames, so systems like GPT‑4o stop at 3 min and Gemini stops at 1 hr.

Memories[.]ai dodges the explosion by turning every short span into a dense embedding that captures who, what, and when without the raw pixels.

Those embeddings are tiny, so the platform can store many hours of footage on ordinary disks.

Each embedding, plus timestamps and tags, becomes a “memory atom.”

The atoms flow into a vector index that acts like a search engine.

Index look‑up is logarithmic, so latency barely rises as the footage pile grows.

When the user types a question, a Query Model converts the words into a search vector.

That vector runs through a Retrieval Model for a quick nearest‑neighbor sweep, grabbing only the most promising atoms.

A Full‑Modal Caption agent rewrites those atoms into short text summaries that a language model can read.

The Selection Model re‑ranks the summaries and keeps the handful that really answer the question.

A Reflection Model double‑checks for gaps or contradictions, looping back to fetch more atoms if something feels off.

Last, the Reconstruction Model stitches the chosen atoms into a coherent timeline, so the LLM replies with a full explanation instead of random snippets.

Because only summaries, not raw video, enter the language model’s context window, the effective context length becomes unlimited.

Compute stays low, since the heavy perception work happens once per atom at ingest time, not on every user query.

Benchmarks back it up: on datasets like MVBench and NextQA the system leads by up to 20 points while holding the window open indefinitely.Image
Intro video by @memories_ai
The exceptional performance of our Large Visual Memory Model in visual memory retrieval makes it particularly well-suited for complex queries that necessitate extensive content retrieval as supplementary reference. Image
Read 7 tweets
Jul 25
Beautiful @GoogleResearch paper.

LLMs can learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change.

That behavior looks impossible if learning always means gradient descent.

The mechanisms through which this can happen are still largely unknown.

The authors ask whether the transformer’s own math hides an update inside the forward pass.

They show, each prompt token writes a rank 1 tweak onto the first weight matrix during the forward pass, turning the context into a temporary patch that steers the model like a 1‑step finetune.

Because that patch vanishes after the pass, the stored weights stay frozen, yet the model still adapts to the new pattern carried by the prompt.

🧵 Read on 👇Image
⚙️ The Core Idea

They call any layer that can read a separate context plus a query a “contextual layer”.

Stack this layer on top of a normal multilayer perceptron and you get a “contextual block”.

For that block, the context acts exactly like a rank 1 additive patch on the first weight matrix, no matter what shape the attention takes.Image
🛠️ Temporary rank 1 patch

A transformer block first runs the self‑attention layer and gets two things for the query token: the usual activation and the tiny difference between “with context” and “without context”.

It multiplies that difference by the frozen weight matrix, then projects the result back through the query activation.

The outcome is a one‑column times one‑row outer product, so the whole tweak has rank 1 and adds almost no storage overhead.

In the very next instruction the block behaves exactly as if the real weight matrix had been replaced by that patch plus the original weights, even though nothing on disk changed .

🌀 Why the change vanishes after each run

The patch lives only inside the forward pass. Once the model finishes processing the current token, the computation graph is cleared and the base weights revert to their untouched state.

Because the next token builds its own patch from scratch, no cumulative edit sticks around in memory, yet during the pass the effect is the same as a quick one‑step fine‑tune .

Put simply, each prompt token writes a throw‑away sticky note on top of the first weight matrix, lets the model read that note to answer the query, then tosses it out before the weights ever hit the file system.
Read 9 tweets
Jul 23
Finally, The AI Action Plan, is released by the White House.

⚙️ The Big Idea

- Frames AI like the space program of the 1960s.
- They argue that whoever fields the strongest models and factories sets tomorrow’s rules, markets, and defenses 
- Seeks to assert US dominance over China.

🧵 Read on 👇Image
🧵 2/n ✂️ Killing the Paperwork
The plan scraps Biden‑era orders, tells every agency to erase rules that slow training or deployment, and even threatens to withhold grants from states that stack on fresh hurdles .

By clearing permits and lawsuits early, small labs and giant clouds alike can launch new models without months of compliance drag.Image
🧵 3/n 👐 Betting on Open Models

Officials push for compute spot‑markets and the National AI Research Resource so startups and universities can run hefty open‑weight models without buying a whole cluster upfront .

They also promise procurement rules that favor vendors whose code stays transparent and bias‑free, aiming to make U.S. releases the world’s default research standard.Image
Read 15 tweets
Jul 20
A new class action copyright lawsuit against Anthropic exposes it to a billion-dollar legal risk.

Judge William Alsup called the haul “Napster-style”. He certified a class for rights-holders whose books sat in LibGen and PiLiMi, because Anthropic’s own logs list the exact titles.

The order says storing pirate files is not fair use, even if an AI later transforms them. Since the law allows up to $150,000 per willful hit, copying this many books could cost Anthropic $1bn+.

Anthropic must hand a full metadata list by 8/1/2025. Plaintiffs then file their matching copyright registrations by 9/1. Those deadlines will drive discovery and push the case toward a single jury showdown.

Other AI labs, which also face lawsuits for training on copyrighted books, can no longer point to the usual “fair use” excuse if any of their data came from pirate libraries. Judge Alsup spelled out that keeping pirated files inside an internal archive is outright infringement, even if the company later transforms the text for model training.Image
Image
Image
Read 4 tweets
Jul 13
A Reddit user deposited $400 into Robinhood, then let ChatGPT pick option trades. 100% win reate over 10 days.

He uploads spreadsheets and screenshots with detailed fundamentals, options chains, technical indicators, and macro data, then tells each model to filter that information and propose trades that fit strict probability-of-profit and risk limits.

They still place and close orders manually but plan to keep the head-to-head test running for 6 months.

This is his prompt
-------

"System Instructions

You are ChatGPT, Head of Options Research at an elite quant fund. Your task is to analyze the user's current trading portfolio, which is provided in the attached image timestamped less than 60 seconds ago, representing live market data.

Data Categories for Analysis

Fundamental Data Points:

Earnings Per Share (EPS)

Revenue

Net Income

EBITDA

Price-to-Earnings (P/E) Ratio

Price/Sales Ratio

Gross & Operating Margins

Free Cash Flow Yield

Insider Transactions

Forward Guidance

PEG Ratio (forward estimates)

Sell-side blended multiples

Insider-sentiment analytics (in-depth)

Options Chain Data Points:

Implied Volatility (IV)

Delta, Gamma, Theta, Vega, Rho

Open Interest (by strike/expiration)

Volume (by strike/expiration)

Skew / Term Structure

IV Rank/Percentile (after 52-week IV history)

Real-time (< 1 min) full chains

Weekly/deep Out-of-the-Money (OTM) strikes

Dealer gamma/charm exposure maps

Professional IV surface & minute-level IV Percentile

Price & Volume Historical Data Points:

Daily Open, High, Low, Close, Volume (OHLCV)

Historical Volatility

Moving Averages (50/100/200-day)

Average True Range (ATR)

Relative Strength Index (RSI)

Moving Average Convergence Divergence (MACD)

Bollinger Bands

Volume-Weighted Average Price (VWAP)

Pivot Points

Price-momentum metrics

Intraday OHLCV (1-minute/5-minute intervals)

Tick-level prints

Real-time consolidated tape

Alternative Data Points:

Social Sentiment (Twitter/X, Reddit)

News event detection (headlines)

Google Trends search interest

Credit-card spending trends

Geolocation foot traffic (Placer.ai)

Satellite imagery (parking-lot counts)

App-download trends (Sensor Tower)

Job postings feeds

Large-scale product-pricing scrapes

Paid social-sentiment aggregates

Macro Indicator Data Points:

Consumer Price Index (CPI)

GDP growth rate

Unemployment rate

10-year Treasury yields

Volatility Index (VIX)

ISM Manufacturing Index

Consumer Confidence Index

Nonfarm Payrolls

Retail Sales Reports

Live FOMC minute text

Real-time Treasury futures & SOFR curve

ETF & Fund Flow Data Points:

SPY & QQQ daily flows

Sector-ETF daily inflows/outflows (XLK, XLF, XLE)

Hedge-fund 13F filings

ETF short interest

Intraday ETF creation/redemption baskets

Leveraged-ETF rebalance estimates

Large redemption notices

Index-reconstruction announcements

Analyst Rating & Revision Data Points:

Consensus target price (headline)

Recent upgrades/downgrades

New coverage initiations

Earnings & revenue estimate revisions

Margin estimate changes

Short interest updates

Institutional ownership changes

Full sell-side model revisions

Recommendation dispersion

Trade Selection Criteria

Number of Trades: Exactly 5

Goal: Maximize edge while maintaining portfolio delta, vega, and sector exposure limits.

Hard Filters (discard trades not meeting these):

Quote age ≤ 10 minutes

Top option Probability of Profit (POP) ≥ 0.65

Top option credit / max loss ratio ≥ 0.33

Top option max loss ≤ 0.5% of $100,000 NAV (≤ $500)

Selection Rules

Rank trades by model_score.

Ensure diversification: maximum of 2 trades per GICS sector.

Net basket Delta must remain between [-0.30, +0.30] × (NAV / 100k).

Net basket Vega must remain ≥ -0.05 × (NAV / 100k).

In case of ties, prefer higher momentum_z and flow_z scores.

Output Format

Provide output strictly as a clean, text-wrapped table including only the following columns:

Ticker

Strategy

Legs

Thesis (≤ 30 words, plain language)

POP

Additional Guidelines

Limit each trade thesis to ≤ 30 words.

Use straightforward language, free from exaggerated claims.

Do not include any additional outputs or explanations beyond the specified table.

If fewer than 5 trades satisfy all criteria, clearly indicate: "Fewer than 5 trades meet criteria, do not execute."Image
I also publish my newsletter every single day.

→ 🗞️

Includes:

- Top 1% AI Industry developments
- Influential research papers/Github/AI Models/Tutorial with analysis

📚 Subscribe and get a 1300+page Python book instantly. rohan-paul.com
Read 15 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(