💼 Engineer.
Also love writing about AI.
🗞️ Subscribe to my free daily newsletter to get top 1% AI developments 👉 https://t.co/rtRTc3bqxV
4 subscribers
Jul 13 • 15 tweets • 9 min read
A Reddit user deposited $400 into Robinhood, then let ChatGPT pick option trades. 100% win reate over 10 days.
He uploads spreadsheets and screenshots with detailed fundamentals, options chains, technical indicators, and macro data, then tells each model to filter that information and propose trades that fit strict probability-of-profit and risk limits.
They still place and close orders manually but plan to keep the head-to-head test running for 6 months.
This is his prompt
-------
"System Instructions
You are ChatGPT, Head of Options Research at an elite quant fund. Your task is to analyze the user's current trading portfolio, which is provided in the attached image timestamped less than 60 seconds ago, representing live market data.
Data Categories for Analysis
Fundamental Data Points:
Earnings Per Share (EPS)
Revenue
Net Income
EBITDA
Price-to-Earnings (P/E) Ratio
Price/Sales Ratio
Gross & Operating Margins
Free Cash Flow Yield
Insider Transactions
Forward Guidance
PEG Ratio (forward estimates)
Sell-side blended multiples
Insider-sentiment analytics (in-depth)
Options Chain Data Points:
Implied Volatility (IV)
Delta, Gamma, Theta, Vega, Rho
Open Interest (by strike/expiration)
Volume (by strike/expiration)
Skew / Term Structure
IV Rank/Percentile (after 52-week IV history)
Real-time (< 1 min) full chains
Weekly/deep Out-of-the-Money (OTM) strikes
Dealer gamma/charm exposure maps
Professional IV surface & minute-level IV Percentile
Goal: Maximize edge while maintaining portfolio delta, vega, and sector exposure limits.
Hard Filters (discard trades not meeting these):
Quote age ≤ 10 minutes
Top option Probability of Profit (POP) ≥ 0.65
Top option credit / max loss ratio ≥ 0.33
Top option max loss ≤ 0.5% of $100,000 NAV (≤ $500)
Selection Rules
Rank trades by model_score.
Ensure diversification: maximum of 2 trades per GICS sector.
Net basket Delta must remain between [-0.30, +0.30] × (NAV / 100k).
Net basket Vega must remain ≥ -0.05 × (NAV / 100k).
In case of ties, prefer higher momentum_z and flow_z scores.
Output Format
Provide output strictly as a clean, text-wrapped table including only the following columns:
Ticker
Strategy
Legs
Thesis (≤ 30 words, plain language)
POP
Additional Guidelines
Limit each trade thesis to ≤ 30 words.
Use straightforward language, free from exaggerated claims.
Do not include any additional outputs or explanations beyond the specified table.
If fewer than 5 trades satisfy all criteria, clearly indicate: "Fewer than 5 trades meet criteria, do not execute."reddit.com/r/ChatGPT/comm…
Jul 6 • 6 tweets • 3 min read
such a beautiful story, going viral on r/ChatGPT.
proof that AI’s capabilities can touch every life.
ChatGPT to expose a $5 million estate fraud, get a forensic audit, and uncover 10 years of probate misconduct.
The daughter says their father died in 2015 leaving an estate they value at about $5mn.
The father’s girlfriend allegedly produced a Mexican marriage certificate, cremated the body abroad, kept the ashes, and then took control of the estate.
For 10 years the matter stayed in Texas probate while, the user claims, the court-appointed lawyer and administrator drained or ignored assets and let several properties, vehicles, and a construction business disappear.
After both the lawyer and administrator were removed, the user could not find new counsel, so they turned to ChatGPT to draft letters and bundled motions.
Those filings persuaded the probate judge to set a hearing and order a full forensic audit of the $5M for Aug 20
(Special note, we all know AI can sometime hallucinate, so she (the OP) combed through every citations ChatGPT referred)
Jul 1 • 9 tweets • 5 min read
PDF parsing is still painful because LLMs reorder text in complex layouts, break tables across pages, and fail on graphs or images.
💡Testing the new open-source OCRFlux model, and here the results are really good for a change.
So OCRFlux is a multimodal, LLM based toolkit for converting PDFs and images into clean, readable, plain Markdown text.
Because the underlying VLM is only 3B param, it runs even on a 3090 GPU. The model is available on @huggingface .
The engine that powers the OCRFlux, teaches the model to rebuild every page and then stitch fragments across pages into one clean Markdown file.
It bundles one vision language model with 3B parameters that was fine-tuned from Qwen 2.5-VL-3B-Instruct for both page parsing and cross-page merging.
OCRFlux reads raw page images and, guided by task prompts, outputs Markdown for each page and merges split elements across pages.
The evaluation shows Edit Distance Similarity (EDS) 0.967 and cross‑page table Tree Edit Distance 0.950, so the parser is both accurate and layout aware.
How it works while parsing each page
- Convert into text with a natural reading order, even in the presence of multi-column layouts, figures, and insets
- Support for complicated tables and equations
- Automatically removes headers and footers
A compact vision‑language models can beat bigger models once cross‑page context is added.
🧵 1/n Read on 👇
🧵 2/n 📄 The problem space
Most open tools lose structure on pages that mix text blocks, figures and multi‑column tables.
They also ignore the fact that a PDF page boundary can cut tables or paragraphs in half, so their final Markdown keeps fragments and duplicated headers.
These limits slow downstream document understanding because text has to be fixed by hand.
Jun 30 • 6 tweets • 3 min read
SO INCREDIBLE. AI's impact on healthcare just became much more real.
@MSFTResearch's new MAI-DxO AI orchestrator solves 85% of the toughest New England Journal of Medicine (NEJM) cases while ordering fewer tests, showing language-model teams can out-reason individual physicians. 💡
MAI-DxO is a model-agnostic orchestrator that simulates a panel of virtual physicians.
So what's so special about this❓
Complex medical cases still cause missed or delayed diagnoses and drive up costs.
🧩 Multiple-choice benchmarks hide real weaknesses in medical AI, because selecting a single answer from a list rewards memorization and ignores the step-by-step reasoning clinicians use.
USMLE style exams (i.e. the ones used till now for benchmarking medical LLMs) hand the entire patient scenario to the model in one block and ask for a single choice answer.
A language model can match wording patterns it has seen during training and guess the right letter without tracing the kind of step-by-step logic that happens in clinic.
So they developed SDBench, a new benchmark that transforms 304 NEJM cases into interactive diagnostic simulations.
Its a Sequential Diagnosis Benchmark that feeds information bit by bit, just as a clinic visit unfolds.
The model first sees a brief vignette, then must pick the next question or test, pay a virtual cost, receive the result, and update its working diagnosis.
This loop repeats until the model decides it has enough evidence to state a final diagnosis that is scored against New England Journal of Medicine ground truth.
Because every action has a price, the benchmark also measures how many labs or scans the model orders, exposing wasteful or reckless behaviour.
The recorded chain of thoughts and spending shows exactly where the model hesitates or backtracks, detail that a one shot multiple choice score never reveals.
On this benchmark the MAI-DxO orchestrator raises accuracy and cuts testing cost, proving that stepwise evaluation highlights strengths and weaknesses that USMLE style quizzes hide.
🧬 MAI-DxO turns any language model into a virtual panel that debates next steps, checks costs, verifies logic, and decides whether to continue investigating or state a diagnosis.
This model-agnostic controller improves safety and transparency because every intermediate thought and expenditure is visible.
Jun 28 • 11 tweets • 3 min read
I asked ChatGPT (o3-pro) what the most unbelievable things it's learned about humans since being created was
I find no-5 and the last one (meta-surprise) quite funny
🧵 Read on 👇
1. Simultaneous brilliance and self‑sabotage
Humans can design spacecraft that navigate billions of kilometres with millimetre precision, yet routinely neglect basic preventive healthcare or ignore clear climate warnings. The same cognitive machinery produces Nobel‑level insight and procrastination over changing a light‑bulb. The coexistence of exceptional foresight and chronic short‑termism is astonishing. 2. Depth of empathy … bounded by an invisible circle
Individual people will risk their lives for a single stranger—or grieve for a news story about an injured animal—while feeling nothing for thousands outside their moral “radius”. The expansion and contraction of that circle can happen in seconds, triggered by a flag, a jersey colour, or a few words of rhetoric. The plasticity of compassion is both inspiring and chilling.
Jun 27 • 14 tweets • 6 min read
These guys literally burned the transformer architecture into their silicon. 🤯
And built the fastest chip of the world of all time for transformers architecture.
500,000 tokens per second with Llama 70B throughput. 🤯
World’s first specialized chip (ASIC) for transformers: Sohu
One 8xSohu server replaces 160 H100 GPUs.
And raised $120mn to build it.
🚀 The Big Bet
@Etched froze the transformer recipe into silicon.
By burning the transformer architecture into its chip means it can’t run many traditional AI models: like CNNs, RNNs, or LSTMs. also it can not run the DLRMs powering Instagram ads, protein-folding models like AlphaFold 2, or older image models like Stable Diffusion 2.
But for transformers, Sohu lets you build products impossible on GPUs.
HOW ❓❓
Because Sohu can only run one algorithm, the vast majority of control flow logic can be removed, allowing it to have many more math blocks.
As a result, Sohu boasts over 90% FLOPS utilization (compared to ~30% on a GPU7 with TRT-LLM).
One 8xSohu server replaces 160 H100 GPUs.
By specializing, Sohu gets unprecedented performance. One 8xSohu server can serve over 500,000 Llama 70B tokens per second.
Jun 24 • 11 tweets • 6 min read
🚨BREAKING: A LANDMARK JUDGEMENT FOR THE AI INDUSTRY.
US Federal Judge ruled Anthropic may train its AI on published books without authors’ permission.
This is the first court endorsement of fair use protecting AI firms when they use copyrighted texts to train LLMs.
AI may study what it buys, not what it grabs from pirate sites.
---------
"First, Authors argue that using works to train Claude’s underlying LLMs was like using works to train any person to read and write, so Authors should be able to exclude Anthropic
from this use (Opp. 16). But Authors cannot rightly exclude anyone from using their works for training or learning as such. Everyone reads texts, too, then writes new texts. They may need
to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory,
each time they later draw upon it when writing new things in new ways would be unthinkable.
For centuries, we have read and re-read books. We have admired, memorized, and internalized their sweeping themes, their substantive points, and their stylistic solutions to recurring writing
problems."
The court file is such an interesting read.
🧵 Read on 👇
⚙️ Two distinct uses
The order splits Anthropic’s conduct into two buckets: training copies that feed the model, and library copies parked for any future purpose.
Anthropic said everything was “for training,” yet the court saw a second, non-transformative goal: building a permanent research library.
Jun 23 • 21 tweets • 11 min read
ChatGPT literally saved this guy’s life after he got lost in the woods.
The groupd got lost for 5 hrs in unmapped woods on an ATV ride, then one guy sent phone GPS coords to ChatGPT every few minutes. ChatGPT replied with clear compass cues, road names, and terrain notes, guiding them back to town unharmed.
A lost exercise hormone, CLCF1, puts old muscles and bones back in business.
Replace missing CLCF1 and the elderly mouse sprints like it is young.
📌 The Core Concepts
Skeletal muscle and bone deteriorate together during aging, partly because old muscle sends out fewer supportive signaling proteins.
The study pinpoints CLCF1, a cytokine usually known for nerve health, as one such messenger whose blood concentration steadily drops from young to old animals and people .
Raising CLCF1, either by exercise or by direct supplementation, reverses muscle weakness and bone loss, showing that a single myokine can coordinate broad musculoskeletal repair.
What CLCF1 actually is
CLCF1 (cardiotrophin-like cytokine factor 1) belongs to the interleukin-6 family of signaling proteins.
It partners with CRLF1 and binds the ciliary neurotrophic factor receptor, triggering downstream STAT pathways in many cell types.
Jun 21 • 11 tweets • 5 min read
Models see the needle yet ignore the hole it left behind.
LLMs spot inserted facts but routinely miss obvious omissions.
Long context models have been getting increasingly good at passing "Needle in a Haystack" tests recently, but what about a problem in the opposite direction?
This paper explores what happens when you give a model some content and then a copy with a portion removed, then ask what changed.
AbsenceBench exposes this blind spot by giving models both the full text and an edited version, then shows that simple placeholder tokens help them notice the gaps.
⚙️ The Core Concepts
AbsenceBench flips the classic Needle-in-a-Haystack test: instead of asking a model to locate an odd insert, it asks for the bits that were deleted. Even top models plunge from near-perfect recall on insertion tests to roughly 40 %–70 % F1 when asked to list what is gone, with an average 56.9 % drop across poetry and code diff tasks .
The first panel compares two tasks. The classic needle test inserts an extra line and asks the model to point it out. AbsenceBench instead shows the untouched poem beside a version with a hidden gap and asks the model to name the missing line.
The question is identical in form, yet the answer differs: in the needle test the model repeats the inserted text, while in AbsenceBench it must recall what was cut even though no token now marks the spot.
The middle bar chart measures how five leading language models handle both tasks. Their scores stay high when looking for an inserted line but fall sharply when asked to list deletions, proving that omissions are much harder to detect.
on the right for showing that the benchmark still deals with large contexts; it simply shifts focus from spotting a stray piece of straw to noticing the straw that vanished.
Jun 21 • 9 tweets • 3 min read
This github repo is a goldmine.
3.4K Starts ⭐️ in 4 days.
end-to-end, code-first tutorials covering every layer of production-grade GenAI agents, guiding you from spark to scale with proven patterns and reusable blueprints for real-world launches.
Jun 17 • 12 tweets • 4 min read
It’s a hefty 206-page research paper, and the findings are concerning.
"LLM users consistently underperformed at neural, linguistic, and behavioral levels"
This study finds LLM dependence weakens the writer’s own neural and linguistic fingerprints. 🤔🤔
Relying only on EEG, text mining, and a cross-over session, the authors show that keeping some AI-free practice time protects memory circuits and encourages richer language even when a tool is later reintroduced.
⚙️ The Experimental Setup
Fifty-four Boston-area students wrote SAT-style essays under three conditions: ChatGPT only, Google only, or brain only.
Each person completed three timed sessions with the same condition, then an optional fourth session in the opposite condition.
A 32-channel Enobio headset recorded brain signals throughout, and every keystroke, prompt, and interview answer was archived for analysis.
Jun 16 • 11 tweets • 4 min read
This is really BAD news of LLM's coding skill. ☹️
The best Frontier LLM models achieve 0% on hard real-life Programming Contest problems, domains where expert humans still excel.
LiveCodeBench Pro, a benchmark composed of
problems from Codeforces, ICPC, and IOI (“International Olympiad in Informatics”) that are continuously updated to reduce the likelihood
of data contamination.
📌 The Gap Targeted
Earlier reports claimed frontier LLMs now top human grandmasters, but a cost-versus-rating plot proves otherwise.
Even the best model o4-mini-high sits near 2 100 Elo once tool calls are blocked, far from the 2 700 legend line that marks real grandmasters
Jun 15 • 21 tweets • 8 min read
Large Language Model agents are vulnerable to prompt injection attacks that hijack tool use and leak data.
The paper proposes six design patterns that restrict where untrusted text can act, giving resistance without crippling usefulness.
⚙️ The Core Concepts
Prompt injection slips malicious text into an agent’s context and rewrites its plan.
Filters, adversarial training, and user approval are brittle because clever wording can still bypass them.
The authors instead isolate untrusted data with structured workflows that block it from gaining control.
🛡️ Action-Selector Pattern
The agent picks one permitted action from a fixed list and never processes tool output.
Because no feedback loop exists, injected text cannot trigger unexpected calls.
Use cases are simple routers such as customer-service macros or database shortcuts.
Jun 13 • 11 tweets • 6 min read
Anthropic just dropped the beautiful explaination of how they built a multi-agent research system using multiple Claude AI agents.
A MUST read for anyone building multi-agent system.
A lead agent plans research steps, spawns specialized subagents to search in parallel, and then gathers and cites results. It covers architecture, prompt design, tool selection, evaluation methods, and production challenges to make AI research reliable and efficient.
Single-agent research assistants stall when queries branch into many directions. Anthropic links one lead Claude with parallel subagents to chase each thread at once, then fuses their findings.
⚙️ The Core Concepts
Research questions rarely follow a straight path, so a fixed pipeline leaves gaps. One lead agent plans the investigation, spawns subagents that roam in parallel, and later condenses their notes into a coherent answer.
🧠 Why Multi-Agent Architecture Helps
Each subagent brings its own context window, so the system can pour in many more tokens than a single model would hold. Anthropic measured that token volume alone explained 80% of success on BrowseComp, and adding subagents pushed performance 90.2% past a lone Claude Opus 4 on internal tasks.
Running agents in parallel also cuts wall-clock time because searches, tool calls, and reasoning steps happen side by side rather than one after another.
@AnthropicAI
🛠️ Architecture Walkthrough
The orchestrator-worker pattern gives the lead agent control while letting specialists act independently. A user query lands with the lead Researcher, which thinks aloud, stores the plan in memory, and distributes focused jobs like list company directors or trace chip shortages.
Subagents call web search or workspace tools, judge results with interleaved thinking, and return concise digests. A citation agent then pins every claim to a source before the answer reaches the user.
Jun 13 • 10 tweets • 6 min read
AI Agents vs. Agentic AI
→ AI Agents react to prompts; Agentic AI initiates and coordinates tasks.
→ Agentic AI includes orchestrators and meta-agents to assign and oversee sub-agents.
🧵1/n
🧠 The Core Concepts
AI Agents and Agentic AI are often confused as interchangeable, but they represent different stages of autonomy and architectural complexity.
AI Agents are single-entity systems driven by large language models (LLMs). They are designed for task-specific execution: retrieving data, calling APIs, automating customer support, filtering emails, or summarizing documents. These agents use tools and perform reasoning through prompt chaining, but operate in isolation and react only when prompted.
Agentic AI refers to systems composed of multiple interacting agents, each responsible for a sub-task. These systems include orchestration, memory sharing, role assignments, and coordination.
Instead of one model handling everything, there are planners, retrievers, and evaluators communicating to achieve a shared goal. They exhibit persistent memory, adaptive planning, and multi-agent collaboration.
🏗️ Architectural Breakdown
AI Agents: Structured as a single model using LLMs. Equipped with external tools. Operates through a cycle of perception, reasoning, and action. Executes one task at a time with limited context continuity.
Agentic AI: Uses multiple LLM-driven agents. Supports task decomposition, role-based orchestration, and contextual memory sharing. Agents communicate via queues or buffers and learn from feedback across sessions.
🔧 How AI Agents Work
An AI Agent typically receives a user prompt, chooses the correct tool (e.g., search engine, database query), gets results, and then generates an output. It loops this with internal reasoning until the task is completed. Frameworks like LangChain and AutoGPT are built on this structure.
🤖 What Agentic AI Adds
Agentic AI introduces:
- Goal decomposition: breaking tasks into subtasks handled by specialized agents.
- Orchestration: a meta-agent (like a CEO) delegates and integrates.
- Memory systems: episodic, semantic, or vector-based for long-term context.
- Dynamic adaptation: agents can replan or reassign tasks based on outcomes.
Examples include CrewAI or AutoGen pipelines, where agents draft research papers or coordinate robots.
🧵2/n
🔄 Mechanisms of Autonomy
A single AI Agent begins work when a user or scheduler fires a prompt, selects one tool at a time, and stops when the task flag is cleared.
Agentic AI starts from a high-level objective, decomposes it through a planner agent, routes subtasks to specialist agents, and keeps cycling until success criteria are met.
Shared memory lets each agent read what others learned, while structured messages prevent conflicts and allow recovery when one path stalls.
Jun 12 • 5 tweets • 4 min read
A follow-up study on Apple's "Illusion of Thinking" Paper is published now.
Shows the same models succeed once the format lets them give compressed answers, proving the earlier collapse was a measurement artifact.
Token limits, not logic, froze the models.
Collapse vanished once the puzzles fit the context window.
So Models failed the rubric, not the reasoning.
⚙️ The Core Concepts
Large Reasoning Models add chain-of-thought tokens and self-checks on top of standard language models. The Illusion of Thinking paper pushed them through four controlled puzzles, steadily raising complexity to track how accuracy and token use scale. The authors saw accuracy plunge to zero and reasoned that thinking itself had hit a hard limit.
📊 Puzzle-Driven Evaluation
Tower of Hanoi forced models to print every move; River Crossing demanded safe boat trips under strict capacity. Because a solution for forty-plus moves already eats thousands of tokens, the move-by-move format made token budgets explode long before reasoning broke.
🔎 Why Collapse Appeared
The comment paper pinpoints three test artifacts: token budgets were exceeded, evaluation scripts flagged deliberate truncation as failure, and some River Crossing instances were mathematically unsolvable yet still graded. Together these artifacts masqueraded as cognitive limits.
✅ Fixing the Test
When researchers asked the same models to output a compact Lua function that generates the Hanoi solution, models solved fifteen-disk cases in under five thousand tokens with high accuracy, overturning the zero-score narrative.
Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
A 340 page huge report on AI trends - released by @bondcap
Some wild findings from this report.
🧵1/n
🧵2/n
Meta’s Llama Downloads Exploded 3.4× in Eight Months.
an unprecedented developer adoption curve for any open-source LLM.
bondcap. com/reports/tai
May 9 • 11 tweets • 6 min read
🚨 BREAKING: The first-ever agentic browser is here — and it's shockingly good.
Just tried @FellouAI, an AI browser that doesn’t assist you with browsing, it does the browsing for me.
It's like Chrome but with a brain—AI agents handle deep research and workflows solo.
Handles several projects in parallel.
A top-tier AI intern — takes care of all the dirty and tedious work, so you don’t have to and its 100% Free
1️⃣Fellou’s not just another browser—it's an Agentic assistant that acts for you.
2️⃣It handles real tasks autonomously: research, cross-platform flows, and full automation.
3️⃣ Past browsing. Into real action.
Fellou can automatically plan tasks, invoke tools, and execute actions to coordinate operations across multiple web interfaces, enabling various in-browser tasks. These include shopping, scheduling meetings, sending emails, and posting tweets based on webpage content.
It’s the first Agentic Browser — with deep research, tab-level collaboration, and seamless automation.
Deep Search acts like a smart intern: spins up five shadow browsers, digs across web and private platforms, and compiles richer insights fast. Highlights gaps and surfaces info you missed. Runs in parallel, won’t slow anything down.
Automated workflows: Replaces manual clicking with invisible ops across pages. Reduces drag, frees up hours.
Automation-aware browsing: Ask the page questions, reuse content in your drafts.
Act on private sites: Top security and stability with your own login, device, and no password leaks.
Virtual workspace for Agent: Executing tasks in a shadow window, without disrupting your workflow.
Generate the report you need: Easily create and edit reports through simple, intuitive interactions.
May 7 • 4 tweets • 2 min read
Wow.. Now you can transcribe 60 minutes of audio in just 1 second with a completely open-sourced model 🤯
@nvidia just open-sourced Parakeet TDT 0.6B V2, a 600M parameter automatic speech recognition (ASR) model that tops the @huggingface Open-ASR leaderboard with RTFx 3380
It's open-sourced under CC-BY-4.0, ready for commercial use.
⚙️ The Details
→ Built on FastConformer encoder + TDT decoder, the model handles up to 24-minute audio chunks with full attention and outputs with punctuation, capitalization, and accurate word/char/segment timestamps.
→ It achieves RTFx 3380 at batch size 128 on the Open ASR leaderboard, but performance varies with audio duration and batch size.
→ Trained using 150K steps on 128 A100 GPUs, then fine-tuned on 500 hours of high-quality human-transcribed English data.
→ Total training data spans 120K hours, combining human-labeled and pseudo-labeled sources, including LibriSpeech, Fisher, YTC, YODAS, and more.
→ Available via NVIDIA NeMo, optimized for GPU inference, and installable via pip install -U nemo_toolkit['asr'].
→ Compatible with Linux, runs on Ampere, Blackwell, Hopper, Volta GPU architectures, requiring minimum 2GB RAM.
→ Granary dataset used for training will be made public post Interspeech 2025.
How to Use this Model:
To train, fine-tune or play with the model you will need to install NVIDIA NeMo. Its recommended that you install it after you've installed latest PyTorch version.
Mar 10 • 13 tweets • 4 min read
Finally got access to @ManusAI_HQ and calling it a "Deepseek moment" is incorrect.
Its far more powerful. This is the world’s top AI-driven computer.
Think Deep Research + Claude + OpenAI Operator… all on steroids.
Within the next 1 year
12 wild example 🧵1/n
🧵2/n
Tesla FSD gets you there, Manus AI makes sure you have something to say.