🌍 Compiling in real-time, the race towards AGI.
🗞️ Don't miss my daily top 1% AI analysis newsletter directly to your inbox 👉 https://t.co/6LBxO8215l
4 subscribers
Aug 3 • 9 tweets • 5 min read
This paper will change how we think about LLM inferencing. 🔥
An alternative to Chain-Of-Thought, instead nspired by how the human brain utilizes distinct systems for slow, deliberate planning and fast, intuitive computation.
Gives us super fast reasoning vs SOTA LLMs with just 1,000 training examples and a 27mn param model.
Unbelievable how a tiny model from a tiny lab of Tsinghua + deep thinking, gets 40% on ARC-AGI and tear into complex sudoku and maze puzzles. 🤯
It removes token-by-token chain-of-thought generation.
Instead, Hierarchical Reasoning Model's (HRM) parallel processing allows for what Wang (Founder and CEO of Sapient Intelligence) estimates could be a “100x speedup in task completion time.”
This means lower inference latency and the ability to run powerful reasoning on edge devices.
📢 There are 3 efficiency techniques
a. Single forward pass reasoning: HRM performs all reasoning inside its hidden states and emits an answer in 1 network pass, while CoT-style LLMs build a long text trace first.
b. Constant-memory training: By replacing back-propagation-through-time with a 1-step gradient, training memory stays at O(1) instead of O(T), which shortens training iterations and improves GPU utilisation.
c. Adaptive Computation Time (ACT): ACT version averages only about one third of the compute steps of a fixed-depth baseline yet keeps the same accuracy, so inference cost per example drops roughly 2-3X, not 100X.
🔧 Final Takeaway
HRM hints that swapping endless layers for a small hierarchy plus cheap recurrence can give LLM‑level reasoning at Raspberry‑Pi costs. It also scales at inference: raise the ACT cap, and accuracy climbs further with no retraining.
🧵 Read on 👇
🧵 2/n Using only 1,000 input-output examples, without pre-training or CoT supervision, HRM learns to solve problems that are intractable for even the most advanced LLMs.
For example, it achieves near-perfect accuracy in complex Sudoku puzzles (Sudoku-Extreme Full) and optimal pathfinding in 30x30 mazes, where state-of-the-art CoT methods completely fail (0% accuracy).
Aug 2 • 8 tweets • 3 min read
Experts can now identify and track you using Wi-Fi signals that bounce off your body - and its over 95% accurate.
A new surveillance method identifies and tracks you using Wi-Fi signals — without needing a phone, camera, or wearable.
Developed by researchers at La Sapienza University of Rome, the system has been dubbed "WhoFi."
Ir reads how Wi-Fi waves interact with a person’s body, essentially creating a unique biometric “fingerprint” based on the way wireless signals bounce off them.
This allows a person to be identified and re-identified across rooms and even different locations, all without visible technology or consent.
🔦 Why swap cameras for radio waves
Wi-Fi keeps working when lights are off, walls block sight, or crowds get in the way. A router sends a signal, the air, walls, bones, and backpacks bend that signal, and every body shape leaves its own tiny distortions. The paper grabs those distortions, known as Channel State Information, as a privacy-friendly fingerprint.
Unlike previous attempts at similar tracking, which topped out at 75% accuracy, WhoFi leverages neural networks and standard, low-cost Wi-Fi routers to achieve unprecedented precision.
The implications are enormous: this could revolutionize everything from retail analytics to law enforcement surveillance, raising pressing questions about privacy.
The system works even through walls and in the dark, potentially making it more powerful than traditional camera systems. While still in the experimental stage, the technology’s reliance on widely available hardware suggests it could be deployed at scale sooner than most would expect.
🧬 What lives inside a Wi-Fi packet
Each packet carries amplitude, how strong the signal arrives, and phase, how the wave shifts over time. Noise and hardware drift skew both pieces, so the team uses median filters for amplitude and a simple line-fitting trick for phase to clean things up.
After a pass of Gaussian noise, random scaling, and tiny time shifts, the data is ready for the network.
Aug 2 • 8 tweets • 6 min read
Anthropic just showed that an AI's “personality” can be traced to specific directions in its brain ("Persona vectors"), and shows what might make it act in evil or unsafe ways.
Sometimes when you're chatting with a model, it suddenly starts behaving oddly—overly flattering, factually wrong, or even outright malicious. This paper is about understanding why that happens, and how to stop it.
🧠 What's going on inside these models?
AI models don’t actually have personalities like humans do, but they sometimes act like they do—especially when prompted a certain way or trained on particular data.
Anthropic’s team found that specific behaviors, like being “evil,” “sycophantic,” or prone to “hallucination,” show up as linear directions inside the model's activation space.
They call these persona vectors.
Think of it like this: if you observe how the model responds in different situations, you can map those behaviors to certain regions inside the model’s brain. And if you spot where these traits live, you can monitor and even control them.
---
The diagram shows a simple pipeline that turns a plain description of a trait such as evil into a single “persona vector”, which is just a pattern of activity inside the model that tracks that trait.
Once this vector exists, engineers can watch the model’s activations and see in real time if the model is drifting toward the unwanted personality while it runs or while it is being finetuned.
The very same vector works like a control knob.
Subtracting it during inference tones the trait down, and sprinkling a small amount of it during training teaches the model to resist picking that trait up in the first place, so regular skills stay intact.
Because each piece of training text can also be projected onto the vector, any snippet that would push the model toward the trait lights up early, letting teams filter or fix that data before it causes trouble.
Al that means, you can control the following of a model
- Watch how a model’s personality evolves, either while chatting or during training
- Control or reduce unwanted personality changes as the model is being developed or trained
- Figure out what training data is pushing those changes
🧵 Read on 👇
🔬 How to make sense of this persona vector?
Think of a large language model as a machine that turns every word it reads into a long list of numbers. That list is called the activation vector for that word, and it might be 4096 numbers long in a model the size of Llama-3.
A persona vector is another list of the same length, but it is not baked into the model’s weights. The team creates it after the model is already trained:
They run the model twice with the same user question, once under a “be evil” system prompt and once under a “be helpful” prompt.
They grab the hidden activations from each run and average them, so they now have two mean activation vectors.
They subtract the helpful average from the evil average. The result is a single direction in that 4096-dimensional space. That direction is the persona vector for “evil.”
Because the vector lives outside the model, you can store it in a tiny file and load it only when you need to check or steer the personality. During inference you add (or subtract) a scaled copy of the vector to the activations at one or more layers. Pushing along the vector makes the bot lean into the trait, pulling against it tones the trait down. During fine-tuning you can sprinkle a bit of the vector in every step to “vaccinate” the model so later data will not push it toward that trait.
So, under the hood, a persona vector is simply a 1-dimensional direction inside the model’s huge activation space, not a chunk of the weight matrix. It is computed once, saved like any other small tensor, and then used as a plug-in dial for personality control.
---
The pipeline is automated, so any new trait just needs a plain-language description and a handful of trigger prompts.
They validate the result by injecting the vector and watching the bot slip instantly into the matching personality.
Aug 2 • 10 tweets • 6 min read
Absolutely deluxe resource on large language models. 👌
LLMs pick up world knowledge just by guessing the next token, and that single trick scales from chatbots to code helpers.
It treats language as a chain of choices, predicting one token at a time based on everything that came before.
Repeating that prediction task across trillions of tokens lets the network squeeze statistical hints about grammar, facts, and even logic into its weights, without any labeled examples.
Growing the model and the data unlocks abrupt jumps in reasoning skill but also makes training fragile.
Keeping context windows huge and latency low is now the top practical hurdle.
🧵 Read on 👇
This figure lays out 3 basic roadmaps for getting a language model ready.
The 1st path pours in a mountain of unlabeled text so the model picks up general language patterns, then finishes with a supervised stage that targets a single job.
The 2nd path starts directly with labeled examples for Task 1, then reuses that same model on Task 2 after a short, extra supervised tune‑up, so the learning carries over.
The 3rd path has become the go‑to choice: it trains on unlabeled text using trick questions the model can answer by itself, builds a strong self‑supervised base, and later needs only a quick supervised pass or even a simple prompt to switch to new tasks.
Helps you propose research ideas and autonomously handles literature review, ideation, algorithm implementation, experimentation and manuscript drafting via containerized multi-agent LLM pipelines.
Benchmarked on 4 domains across 2 task levels: reaches 81 % novelty and 0.92 F1 versus human papers while emitting codebases, GUI and Docker stacks in <3 h per project.
✨ The AI-Researcher system accepts user input queries at two distinct levels ✨
Level 1: Detailed Idea Description
At this level, users provide comprehensive descriptions of their specific research ideas. The system processes these detailed inputs to develop implementation strategies based on the user's explicit requirements.
Level 2: Reference-Based Ideation
This simpler level involves users submitting reference papers without a specific idea in mind. The user query typically follows the format: "I have some reference papers, please come up with an innovative idea and implement it with these papers." The system then analyzes the provided references to generate and develop novel research concepts.
Jul 30 • 8 tweets • 7 min read
This is such a revelation 😯
New Wharton study finds AI Bots collude to rig financial markets.
The authors power their AI trading bots with Q‑learning
💰 AI trading bots trained with reinforcement learning started fixing prices in simulated markets, scoring collusion capacity even when noise was high or low.
And messy price signals that usually break weak human strategies do not break this AI cartel.
🤖 The study sets up a fake exchange that mimics real stock order flow.
Regular actors, such as mutual funds that buy and hold, market makers that quote bids and asks, and retail accounts that chase memes, fill the room. Onto that floor the team drops a clan of reinforcement‑learning agents.
Each bot seeks profit but sees only its own trades and rewards. There is no chat channel, no shared memory, no secret code.
Given a few thousand practice rounds, the AI agents quietly shift from competition to cooperation. They begin to space out orders so everyone in the group collects a comfortable margin.
When each bot starts earning steady profit, its learning loop says “good enough,” so it quits searching for fresh tactics. That halt in exploration is what the authors call artificial stupidity. Because every agent shuts down curiosity at the same time, the whole group locks into the price‑fixing routine and keeps it running with almost no extra effort.
This freeze holds whether the market is calm or full of random noise. In other words, messy price signals that usually break weak strategies do not break this cartel. That makes the coordination harder to spot and even harder to shake loose once it forms.
🕵️This behavior highlights a blind spot in current market rules. Surveillance tools hunt for human coordination through messages or phone logs, yet these bots coordinate by simply reading the tape and reacting.
Tight limits on model size or memory do not help, as simpler agents slide even faster into the lazy profit split. The work argues that regulators will need tests that watch outcomes, not intent, if AI execution keeps spreading.
The page explains why the authors power their trading bots with Q‑learning, a plain version of reinforcement learning. Q‑learning gives a solid base for many modern AI tricks, is already popular on trading desks, and is easy to read and audit.
Next it introduces the Bellman idea. Think of a bot in a market. At any instant it sees a “state”, like recent price and value signals. It chooses an order size, pockets today’s gain or loss, then cares about tomorrow’s gains too, but with a discount so near‑term cash matters more.
To handle that, the bot keeps a Q‑table. Each cell stores a score for doing a certain action in a certain state. After every trade the score is nudged toward “today’s profit plus the best score it now expects for tomorrow”.
Repeated millions of times, those tiny updates teach many bots how each move affects later prices and payoffs. Inside the study this self‑teaching is the fuel that lets separate bots quietly line up their trades and earn cartel‑level profits without ever swapping messages.
Jul 29 • 6 tweets • 4 min read
Its going viral on Reddit.
Somebody let ChatGPT run a $100 live share portfolio, restricted to U.S. micro-cap stocks.
Did an LLM really bit the market?.
- 4 weeks +23.8%
while the Russell 2000 and biotech ETF XBI rose only ~3.9% and 3.5%.
Prompt + GitHub posted
---
ofcourse its a short‑term outperformance, tiny sample size, and also micro caps are hightly volatile.
So much more exahustive analysis is needed with lots or more info (like Sharpe ratios and longer back-testing etc), to explore whether an LLM can truly beat the market.
His original prompt..
The prompt first anchors the model in a clear professional role, then boxes it in with tight, measurable rules
----
“ You are a professional-grade portfolio strategist. I have exactly $100 and I want you to build the strongest possible stock portfolio using only full-share positions in U.S.-listed micro-cap stocks (market cap under $300M). Your objective is to generate maximum return from today (6-27-25) to 6 months from now (12-27-25). This is your timeframe, you may not make any decisions after the end date. Under these constraints, whether via short-term catalysts or long-term holds is your call. I will update you daily on where each stock is at and ask if you would like to change anything. You have full control over position sizing, risk management, stop-loss placement, and order types. You may concentrate or diversify at will. Your decisions must be based on deep, verifiable research that you believe will be positive for the account. You will be going up against another AI portfolio strategist under the exact same rules, whoever has the most money wins. Now, use deep research and create your portfolio.”
Jul 29 • 19 tweets • 9 min read
Brilliant survey paper, a colab between a whole lot of top Universities. 🫡
Self‑evolving agents, promise LLM‑powered systems that upgrade themselves during use instead of freezing at deployment.
Right now, most agents ship as fixed models that cannot tweak their own weights, memories, or toolkits once the job starts.
🚦 Why static agents stall
An LLM can plan, query tools, and chat, yet its inside stays unchanged after training. That rigidity hurts long‑running tasks where goals shift, data drifts, or a user teaches the agent new tricks on the fly. The authors call this the “static bottleneck” and argue that real autonomy needs continuous self‑improvement.
The survey organizes everything around 3 questions: what to evolve, when to evolve, and how to evolve.
- What to evolve spans the model, memory, prompts, tools, and the wider agent architecture so upgrades hit the exact weak piece.
- When to evolve divides quick inside‑task tweaks from heavier between‑task updates, powered by in‑context learning, supervised fine‑tuning, or reinforcement learning.
- How to evolve falls into 3 method families: reward signals, imitation or demonstration learning, and population style evolution that breeds multiple agents.
- Proper evaluation needs metrics that track adaptivity, safety, efficiency, retention, and generalization over long stretches.
- Early case studies in coding, education, and healthcare show that on‑the‑fly learning can cut manual upkeep and boost usefulness.
- Key obstacles remain around compute cost, privacy, and keeping self‑updates safe and well aligned.
- The authors frame these agents as the practical midpoint on the road from today’s chatbots to Artificial Super Intelligence.
- The big shift they highlight is moving away from scaling frozen models and toward building smaller agents that constantly upgrade themselves.
🧵 Read on 👇
Progression from large language models (LLMs) to foundation agents, advancing to self-evolving agents
It starts with basic LLMs like GPT‑4 and Claude‑4, shifts to foundation agents that can plan and call tools, moves to self‑evolving agents that adjust themselves while working, and finally points toward the still‑theoretical ASI tier.
Each rung adds one big skill. First comes plain language understanding, then execution through tool use and planning, then learning from fresh feedback in real time. Intelligence and flexibility rise step by step.
The red pin under “Self‑evolving Agents” shows where the survey sits. The authors map this middle zone because mastering on‑the‑fly upgrades is seen as the key handoff from today’s rigid bots to tomorrow’s fully autonomous systems.
Jul 26 • 10 tweets • 6 min read
MASSIVE claim in this paper.
AI Architectural breakthroughs can be scaled computationally, transforming research progress from a human-limited to a computation-scalable process.
So it turns architecture discovery into a compute‑bound process, opening a path to self‑accelerating model evolution without waiting for human intuition.
The paper shows that an all‑AI research loop can invent novel model architectures faster than humans, and the authors prove it by uncovering 106 record‑setting linear‑attention designs that outshine human baselines.
Right now, most architecture search tools only fine‑tune blocks that people already proposed, so progress crawls at the pace of human trial‑and‑error.
🧩 Why we needed a fresh approach
Human researchers tire quickly, and their search space is narrow. As model families multiply, deciding which tweak matters becomes guesswork, so whole research agendas stall while hardware idles.
🤖 Meet ASI‑ARCH, the self‑driving lab
The team wired together three LLM‑based roles. A “Researcher” dreams up code, an “Engineer” trains and debugs it, and an “Analyst” mines the results for patterns, feeding insights back to the next round. A memory store keeps every motivation, code diff, and metric so the agents never repeat themselves.
📈 Across 1,773 experiments and 20,000 GPU hours, a straight line emerged between compute spent and new SOTA hits.
Add hardware, and the system keeps finding winners without extra coffee or conferences.
📈 Across 1,773 experiments and 20,000 GPU hours, a straight line emerged between compute spent and new SOTA hits.
Add hardware, and the system keeps finding winners without extra coffee or conferences.
Jul 25 • 4 tweets • 4 min read
LLMs can pick up new tricks/learning on the fly because every token in the prompt writes a rank 1 “sticky note” onto the first weight matrix during the forward pass, then tosses that note away when the pass ends.
This “sticky note” is algebraically identical to one tiny gradient‑descent step on a proxy loss
But because the patch (the “sticky note”) is applied on‑the‑fly and discarded right after generation, the checkpoint on disk never changes.
This paper also shows the differences in performance observed between in-context learning vs training.
they run a head‑to‑head test where 1‑step rank 1 patches (their in‑context method) are matched against classic gradient descent fine‑tuning on the very same data stream.
The below figure shows both curves fall almost together, meaning the implicit patch reaches nearly the same loss even though it moves only the first MLP weight matrix and leaves embeddings frozen.
They argue the gap stays small on their linear‑regression toy because that single matrix already captures the needed adjustment. However, they also caution that a rank 1 tweak cannot edit embeddings or deeper layers, so tougher tasks that demand broader weight shifts may still favour full training.
Jul 25 • 7 tweets • 4 min read
This is incredible. 😯
@memories_ai just released world’s first Large Visual Memory Model (LVMM) with unlimited visual memory for AI.
To give AI human-like visual memories. Video understanding with ultra-low hallucinations on an unlimited context window.
Their "context window is virtually unlimited. Yes, you read that right."
Some usecases - 👇
- You can now ask questions like "Show me all unattended bags in the main terminal" and instantly search massive video archives.
- They indexed 1M TikTok videos, so you can ask things like "What’s the viral cosmetics trend?" or "Which influencer featured Tesla cars?" across millions of posts.
So HOW does it do it?
💡 It shrinks each frame into a lightweight “memory atom,” files those atoms in a search‑style index, then pulls back just the relevant atoms when someone asks a question.
🏗️ The trick removes the usual context cap, so answer quality stays high even after 1M+ frames.
The usual video model drags the whole clip into its attention buffer.
That buffer explodes once the clip runs past a few thousand frames, so systems like GPT‑4o stop at 3 min and Gemini stops at 1 hr.
Memories[.]ai dodges the explosion by turning every short span into a dense embedding that captures who, what, and when without the raw pixels.
Those embeddings are tiny, so the platform can store many hours of footage on ordinary disks.
Each embedding, plus timestamps and tags, becomes a “memory atom.”
The atoms flow into a vector index that acts like a search engine.
Index look‑up is logarithmic, so latency barely rises as the footage pile grows.
When the user types a question, a Query Model converts the words into a search vector.
That vector runs through a Retrieval Model for a quick nearest‑neighbor sweep, grabbing only the most promising atoms.
A Full‑Modal Caption agent rewrites those atoms into short text summaries that a language model can read.
The Selection Model re‑ranks the summaries and keeps the handful that really answer the question.
A Reflection Model double‑checks for gaps or contradictions, looping back to fetch more atoms if something feels off.
Last, the Reconstruction Model stitches the chosen atoms into a coherent timeline, so the LLM replies with a full explanation instead of random snippets.
Because only summaries, not raw video, enter the language model’s context window, the effective context length becomes unlimited.
Compute stays low, since the heavy perception work happens once per atom at ingest time, not on every user query.
Benchmarks back it up: on datasets like MVBench and NextQA the system leads by up to 20 points while holding the window open indefinitely.
Intro video by @memories_ai
Jul 25 • 9 tweets • 4 min read
Beautiful @GoogleResearch paper.
LLMs can learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change.
That behavior looks impossible if learning always means gradient descent.
The mechanisms through which this can happen are still largely unknown.
The authors ask whether the transformer’s own math hides an update inside the forward pass.
They show, each prompt token writes a rank 1 tweak onto the first weight matrix during the forward pass, turning the context into a temporary patch that steers the model like a 1‑step finetune.
Because that patch vanishes after the pass, the stored weights stay frozen, yet the model still adapts to the new pattern carried by the prompt.
🧵 Read on 👇
⚙️ The Core Idea
They call any layer that can read a separate context plus a query a “contextual layer”.
Stack this layer on top of a normal multilayer perceptron and you get a “contextual block”.
For that block, the context acts exactly like a rank 1 additive patch on the first weight matrix, no matter what shape the attention takes.
Jul 23 • 15 tweets • 5 min read
Finally, The AI Action Plan, is released by the White House.
⚙️ The Big Idea
- Frames AI like the space program of the 1960s.
- They argue that whoever fields the strongest models and factories sets tomorrow’s rules, markets, and defenses
- Seeks to assert US dominance over China.
🧵 Read on 👇
🧵 2/n ✂️ Killing the Paperwork
The plan scraps Biden‑era orders, tells every agency to erase rules that slow training or deployment, and even threatens to withhold grants from states that stack on fresh hurdles .
By clearing permits and lawsuits early, small labs and giant clouds alike can launch new models without months of compliance drag.
Jul 20 • 4 tweets • 2 min read
A new class action copyright lawsuit against Anthropic exposes it to a billion-dollar legal risk.
Judge William Alsup called the haul “Napster-style”. He certified a class for rights-holders whose books sat in LibGen and PiLiMi, because Anthropic’s own logs list the exact titles.
The order says storing pirate files is not fair use, even if an AI later transforms them. Since the law allows up to $150,000 per willful hit, copying this many books could cost Anthropic $1bn+.
Anthropic must hand a full metadata list by 8/1/2025. Plaintiffs then file their matching copyright registrations by 9/1. Those deadlines will drive discovery and push the case toward a single jury showdown.
Other AI labs, which also face lawsuits for training on copyrighted books, can no longer point to the usual “fair use” excuse if any of their data came from pirate libraries. Judge Alsup spelled out that keeping pirated files inside an internal archive is outright infringement, even if the company later transforms the text for model training.
Jul 13 • 15 tweets • 9 min read
A Reddit user deposited $400 into Robinhood, then let ChatGPT pick option trades. 100% win reate over 10 days.
He uploads spreadsheets and screenshots with detailed fundamentals, options chains, technical indicators, and macro data, then tells each model to filter that information and propose trades that fit strict probability-of-profit and risk limits.
They still place and close orders manually but plan to keep the head-to-head test running for 6 months.
This is his prompt
-------
"System Instructions
You are ChatGPT, Head of Options Research at an elite quant fund. Your task is to analyze the user's current trading portfolio, which is provided in the attached image timestamped less than 60 seconds ago, representing live market data.
Data Categories for Analysis
Fundamental Data Points:
Earnings Per Share (EPS)
Revenue
Net Income
EBITDA
Price-to-Earnings (P/E) Ratio
Price/Sales Ratio
Gross & Operating Margins
Free Cash Flow Yield
Insider Transactions
Forward Guidance
PEG Ratio (forward estimates)
Sell-side blended multiples
Insider-sentiment analytics (in-depth)
Options Chain Data Points:
Implied Volatility (IV)
Delta, Gamma, Theta, Vega, Rho
Open Interest (by strike/expiration)
Volume (by strike/expiration)
Skew / Term Structure
IV Rank/Percentile (after 52-week IV history)
Real-time (< 1 min) full chains
Weekly/deep Out-of-the-Money (OTM) strikes
Dealer gamma/charm exposure maps
Professional IV surface & minute-level IV Percentile
Goal: Maximize edge while maintaining portfolio delta, vega, and sector exposure limits.
Hard Filters (discard trades not meeting these):
Quote age ≤ 10 minutes
Top option Probability of Profit (POP) ≥ 0.65
Top option credit / max loss ratio ≥ 0.33
Top option max loss ≤ 0.5% of $100,000 NAV (≤ $500)
Selection Rules
Rank trades by model_score.
Ensure diversification: maximum of 2 trades per GICS sector.
Net basket Delta must remain between [-0.30, +0.30] × (NAV / 100k).
Net basket Vega must remain ≥ -0.05 × (NAV / 100k).
In case of ties, prefer higher momentum_z and flow_z scores.
Output Format
Provide output strictly as a clean, text-wrapped table including only the following columns:
Ticker
Strategy
Legs
Thesis (≤ 30 words, plain language)
POP
Additional Guidelines
Limit each trade thesis to ≤ 30 words.
Use straightforward language, free from exaggerated claims.
Do not include any additional outputs or explanations beyond the specified table.
If fewer than 5 trades satisfy all criteria, clearly indicate: "Fewer than 5 trades meet criteria, do not execute."reddit.com/r/ChatGPT/comm…
Jul 6 • 6 tweets • 3 min read
such a beautiful story, going viral on r/ChatGPT.
proof that AI’s capabilities can touch every life.
ChatGPT to expose a $5 million estate fraud, get a forensic audit, and uncover 10 years of probate misconduct.
The daughter says their father died in 2015 leaving an estate they value at about $5mn.
The father’s girlfriend allegedly produced a Mexican marriage certificate, cremated the body abroad, kept the ashes, and then took control of the estate.
For 10 years the matter stayed in Texas probate while, the user claims, the court-appointed lawyer and administrator drained or ignored assets and let several properties, vehicles, and a construction business disappear.
After both the lawyer and administrator were removed, the user could not find new counsel, so they turned to ChatGPT to draft letters and bundled motions.
Those filings persuaded the probate judge to set a hearing and order a full forensic audit of the $5M for Aug 20
(Special note, we all know AI can sometime hallucinate, so she (the OP) combed through every citations ChatGPT referred)
Jul 1 • 9 tweets • 5 min read
PDF parsing is still painful because LLMs reorder text in complex layouts, break tables across pages, and fail on graphs or images.
💡Testing the new open-source OCRFlux model, and here the results are really good for a change.
So OCRFlux is a multimodal, LLM based toolkit for converting PDFs and images into clean, readable, plain Markdown text.
Because the underlying VLM is only 3B param, it runs even on a 3090 GPU. The model is available on @huggingface .
The engine that powers the OCRFlux, teaches the model to rebuild every page and then stitch fragments across pages into one clean Markdown file.
It bundles one vision language model with 3B parameters that was fine-tuned from Qwen 2.5-VL-3B-Instruct for both page parsing and cross-page merging.
OCRFlux reads raw page images and, guided by task prompts, outputs Markdown for each page and merges split elements across pages.
The evaluation shows Edit Distance Similarity (EDS) 0.967 and cross‑page table Tree Edit Distance 0.950, so the parser is both accurate and layout aware.
How it works while parsing each page
- Convert into text with a natural reading order, even in the presence of multi-column layouts, figures, and insets
- Support for complicated tables and equations
- Automatically removes headers and footers
A compact vision‑language models can beat bigger models once cross‑page context is added.
🧵 1/n Read on 👇
🧵 2/n 📄 The problem space
Most open tools lose structure on pages that mix text blocks, figures and multi‑column tables.
They also ignore the fact that a PDF page boundary can cut tables or paragraphs in half, so their final Markdown keeps fragments and duplicated headers.
These limits slow downstream document understanding because text has to be fixed by hand.
Jun 30 • 6 tweets • 3 min read
SO INCREDIBLE. AI's impact on healthcare just became much more real.
@MSFTResearch's new MAI-DxO AI orchestrator solves 85% of the toughest New England Journal of Medicine (NEJM) cases while ordering fewer tests, showing language-model teams can out-reason individual physicians. 💡
MAI-DxO is a model-agnostic orchestrator that simulates a panel of virtual physicians.
So what's so special about this❓
Complex medical cases still cause missed or delayed diagnoses and drive up costs.
🧩 Multiple-choice benchmarks hide real weaknesses in medical AI, because selecting a single answer from a list rewards memorization and ignores the step-by-step reasoning clinicians use.
USMLE style exams (i.e. the ones used till now for benchmarking medical LLMs) hand the entire patient scenario to the model in one block and ask for a single choice answer.
A language model can match wording patterns it has seen during training and guess the right letter without tracing the kind of step-by-step logic that happens in clinic.
So they developed SDBench, a new benchmark that transforms 304 NEJM cases into interactive diagnostic simulations.
Its a Sequential Diagnosis Benchmark that feeds information bit by bit, just as a clinic visit unfolds.
The model first sees a brief vignette, then must pick the next question or test, pay a virtual cost, receive the result, and update its working diagnosis.
This loop repeats until the model decides it has enough evidence to state a final diagnosis that is scored against New England Journal of Medicine ground truth.
Because every action has a price, the benchmark also measures how many labs or scans the model orders, exposing wasteful or reckless behaviour.
The recorded chain of thoughts and spending shows exactly where the model hesitates or backtracks, detail that a one shot multiple choice score never reveals.
On this benchmark the MAI-DxO orchestrator raises accuracy and cuts testing cost, proving that stepwise evaluation highlights strengths and weaknesses that USMLE style quizzes hide.
🧬 MAI-DxO turns any language model into a virtual panel that debates next steps, checks costs, verifies logic, and decides whether to continue investigating or state a diagnosis.
This model-agnostic controller improves safety and transparency because every intermediate thought and expenditure is visible.
Jun 28 • 11 tweets • 3 min read
I asked ChatGPT (o3-pro) what the most unbelievable things it's learned about humans since being created was
I find no-5 and the last one (meta-surprise) quite funny
🧵 Read on 👇
1. Simultaneous brilliance and self‑sabotage
Humans can design spacecraft that navigate billions of kilometres with millimetre precision, yet routinely neglect basic preventive healthcare or ignore clear climate warnings. The same cognitive machinery produces Nobel‑level insight and procrastination over changing a light‑bulb. The coexistence of exceptional foresight and chronic short‑termism is astonishing. 2. Depth of empathy … bounded by an invisible circle
Individual people will risk their lives for a single stranger—or grieve for a news story about an injured animal—while feeling nothing for thousands outside their moral “radius”. The expansion and contraction of that circle can happen in seconds, triggered by a flag, a jersey colour, or a few words of rhetoric. The plasticity of compassion is both inspiring and chilling.
Jun 27 • 14 tweets • 6 min read
These guys literally burned the transformer architecture into their silicon. 🤯
And built the fastest chip of the world of all time for transformers architecture.
500,000 tokens per second with Llama 70B throughput. 🤯
World’s first specialized chip (ASIC) for transformers: Sohu
One 8xSohu server replaces 160 H100 GPUs.
And raised $120mn to build it.
🚀 The Big Bet
@Etched froze the transformer recipe into silicon.
By burning the transformer architecture into its chip means it can’t run many traditional AI models: like CNNs, RNNs, or LSTMs. also it can not run the DLRMs powering Instagram ads, protein-folding models like AlphaFold 2, or older image models like Stable Diffusion 2.
But for transformers, Sohu lets you build products impossible on GPUs.
HOW ❓❓
Because Sohu can only run one algorithm, the vast majority of control flow logic can be removed, allowing it to have many more math blocks.
As a result, Sohu boasts over 90% FLOPS utilization (compared to ~30% on a GPU7 with TRT-LLM).
One 8xSohu server replaces 160 H100 GPUs.
By specializing, Sohu gets unprecedented performance. One 8xSohu server can serve over 500,000 Llama 70B tokens per second.
Jun 24 • 11 tweets • 6 min read
🚨BREAKING: A LANDMARK JUDGEMENT FOR THE AI INDUSTRY.
US Federal Judge ruled Anthropic may train its AI on published books without authors’ permission.
This is the first court endorsement of fair use protecting AI firms when they use copyrighted texts to train LLMs.
AI may study what it buys, not what it grabs from pirate sites.
---------
"First, Authors argue that using works to train Claude’s underlying LLMs was like using works to train any person to read and write, so Authors should be able to exclude Anthropic
from this use (Opp. 16). But Authors cannot rightly exclude anyone from using their works for training or learning as such. Everyone reads texts, too, then writes new texts. They may need
to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory,
each time they later draw upon it when writing new things in new ways would be unthinkable.
For centuries, we have read and re-read books. We have admired, memorized, and internalized their sweeping themes, their substantive points, and their stylistic solutions to recurring writing
problems."
The court file is such an interesting read.
🧵 Read on 👇
⚙️ Two distinct uses
The order splits Anthropic’s conduct into two buckets: training copies that feed the model, and library copies parked for any future purpose.
Anthropic said everything was “for training,” yet the court saw a second, non-transformative goal: building a permanent research library.