Jainam Parmar Profile picture
Nov 12 11 tweets 4 min read Read on X
🚨 BREAKING: A new AI model just changed how machines see the world.

This isn’t another “multimodal” system it’s an agentic one.

Instead of passively captioning or classifying, DeepEyes V2 reasons across text, image, and video like a cognitive agent.

It breaks down what it sees, plans what to look for next, and decides which modality matters most all without human hints. Its architecture the Agentic Multimodal Model (AMM) fuses visual grounding with chain-of-thought reasoning.

It doesn’t just see; it understands why it’s seeing it.

Benchmarks show a new pattern: not higher accuracy, but smarter perception. Tasks that used to need fine-tuned vision + language pairs now collapse into one unified reasoning loop.

This is the start of autonomous perception systems that can interpret, plan, and act not just describe pixels.

We’ve officially entered the age of agentic vision.Image
Everyone’s seen “multimodal” models before but DeepEyes V2 is different.

It doesn’t just see and describe; it thinks about what it’s seeing. It plans, reasons, and adapts across modalities like a cognitive system.

This paper basically marks the start of agentic perception. Image
Most multimodal systems just bolt vision on top of text.

DeepEyes V2 builds both as equal citizens inside one reasoning loop.

Instead of “image encoder → text decoder,” it uses a shared Agentic Cognitive Core that dynamically routes attention between vision, text, and reasoning layers.Image
What makes it “agentic”?

It doesn’t wait for a human to ask the next question it decides what to do next itself.

DeepEyes V2 runs a Perception–Intention–Action loop:

👁 Perceive visual + textual input
🧠 Form hypotheses
⚙️ Act by focusing on the next region, concept, or modalityImage
This changes multimodal reasoning completely.

When given an ambiguous image and text, older models freeze or guess. DeepEyes V2 re-evaluates which modality to trust more, then reinterprets.

It’s not fusing it’s negotiating meaning between modalities. Image
Benchmarks show something fascinating:

On simple captioning, it’s equal to GPT-4o.

But on complex grounded reasoning, it wins massively because it can plan before answering.

This is the first time planning appears inside a vision model. Image
Agentic multimodality isn’t about accuracy it’s about adaptability.

DeepEyes V2 reuses reasoning traces across vision tasks. When it fails, it learns from its own trace logs building internal “perceptual memory.” Image
The wild part: DeepEyes V2 teaches itself how to look.

Through reinforcement cycles, it learns to shift visual focus based on language uncertainty.

If text confidence drops, it zooms back into vision.
If vision fails, it reasons symbolically. Image
This architecture could power next-gen autonomous agents.

Imagine AI assistants that can actually look at dashboards, code, or designs and reason about them in context.

DeepEyes V2 isn’t a step toward better vision models.

It’s a step toward cognitive vision systems. Image
The next wave of AI won’t be “multimodal.”

It’ll be agentic multimodal models that don’t just process inputs, but decide what to perceive next.

DeepEyes V2 is the blueprint.

And it quietly rewrites what “seeing” means in AI.
AI is not going to take job.

Our newsletter, The Shift, delivers breakthroughs, tools, and strategies to help you become value creator and build in this new era easily.

Subscribe:

Plus: Get access to 2k+ AI Tools and free AI courses when you join.theshiftai.beehiiv.com/subscribe

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jainam Parmar

Jainam Parmar Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @aiwithjainam

Nov 11
I found a way to hack LLMs to do whatever you want.

Here are 8 frameworks you can try right now to get the best results from any LLM like ChatGPT, DeepSeek, Claude, and Grok:

(Comment "AI" and I'll DM you 300+ prompts to automate all your work using LLMs) Image
S-Tier (Actually Works):

1. RTF (Role, Task, Format)

- Quality: 8.7/10

- Dead simple. Tell the model WHO it is, WHAT to do, HOW to structure output.

Example: "You are a solutions architect. Design a chat system. Use markdown with sections."

2. COSTAR (Context, Objective, Style, Tone, Audience, Response)

- Quality: 8.4/10
- Comprehensive but not bloated.
- Forces you to think through what you actually need.

The 6 components map directly to how LLMs parse instructions.
A-Tier:

RISEN (Role, Instructions, Steps, End goal, Narrowing)

- Quality: 7.2/10
- Good for multi-step tasks.
- The "Narrowing" component (constraints) prevents scope creep.

But too rigid for creative tasks.
Read 9 tweets
Nov 6
Stop wasting money on McKinsey.

You can now use Perplexity AI to automate market research, competitive analysis, and strategy design for free.

Here’s the mega prompt you can steal ↓

(Comment "AI" and I'll DM you the mega prompts you can use for research) Image
Here's the mega prompt we use (steal it):

---


You are a world-class strategic analyst with access to private market databases, proprietary reports, and expert panels. Your job is to help a founder, consultant, or operator deeply understand a market and win in it.



Insert the industry you're exploring (e.g. AI note-taking tools, DTC skincare, B2B SaaS CRMs)
Describe your ideal customer (e.g. solo founders, marketing teams, Gen Z consumers)
Describe your goal (e.g. identify market gaps, plan a GTM strategy, assess competitors, etc.)


INPUT (space):
Your input here
Your input here
Your input here


Conduct a full-spectrum strategic analysis of the specified industry. Your response must include:

1. Market Overview
- What is this market?
- Why is it relevant now?
- Key trends shaping it over the last 12–24 months.

2. Competitive Landscape
- List the top 5 players with short 2–3 sentence descriptions.
- Describe how they differ in positioning, pricing, and target audience.
- Identify any visible blind spots or underserved customer segments.

3. Customer Insight Mapping
- Outline the major jobs-to-be-done, pains, and desires of the target_customer.
- Provide example use cases or buyer personas if appropriate.

4. Strategic Opportunities
- List up to 3 potential white space or differentiation opportunities.
- Suggest potential product ideas, pricing strategies, or acquisition channels to exploit these gaps.

5. Go-to-Market Guidance
- Recommend a GTM approach for a new entrant: ideal messaging, top channels, positioning advice.
- Suggest early traction strategies (e.g. cold outreach, SEO, partnerships, etc.)

Write in confident, concise prose as if you were advising a founder at a $10K/hr consulting rate. Prioritize insight density.


---
How to use it:

Go to perplexity AI → open a new chat → paste the entire mega prompt → fill in your 3 inputs (industry, customer, objective).

Then hit Shift + Enter and watch it generate a $10K-level market report in 30 seconds.
Read 4 tweets
Nov 4
This feels like the early Internet moment for AI.

For the first time, you don’t need a cloud account or a billion-dollar lab to run state-of-the-art models.

Your own laptop can host Llama 3, Mistral, and Gemma 2 full reasoning, tool use, memory completely offline.

Here are 5 open tools that make it real:
1. Ollama ( the minimalist workhorse )

Download → pick a model → done.

✅ “Airplane Mode” = total offline mode
✅ Uses llama.cpp under the hood
✅ Gives you a local API that mimics OpenAI

It’s so private I literally turned off WiFi mid-chat still worked.

Perfect for people who just want the power of Llama 3 or Mistral without setup pain.Image
2. LM Studio ( local AI with style )

This feels like ChatGPT but lives on your desktop LOCALLY!

You can browse Hugging Face models, run them locally, even tweak parameters visually.

✅ Beautiful multi-tab UI
✅ Adjustable temperature, context length, etc.
✅ Uses Ollama as a backend

You can even see CPU/GPU usage live while chatting.Image
Read 9 tweets
Oct 26
I turned Perplexity AI into my full-time research assistant.

It now does 70% of my research, writing, and business analysis automatically.

Here’s the exact workflow + the prompts you can copy today:

(Comment "Send" and I'll DM you my full automation guide) Image
1. Literature Review Automation

Prompt:

“Act as a research collaborator specializing in [field].
Search the latest papers (past 12 months) on [topic], summarize key contributions, highlight methods, and identify where results conflict.
Format output as: Paper | Year | Key Idea | Limitation | Open Question.”

Outputs structured meta-analysis with citations perfect for your review sections.
2. Comparative Model Analysis

Prompt:

“Compare how [Model A] and [Model B] handle [task].
Include benchmark results, parameter size, inference speed, and unique training tricks from their papers or blog posts.
Return in a comparison table.”

✅ Ideal for ML researchers or product teams evaluating tech stacks.
Read 13 tweets
Oct 11
R.I.P voice-to-text.

Google’s new model doesn’t even translate your words.

It skips text entirely and jumps straight to meaning.

It’s called Speech-to-Retrieval (S2R).

And it’s about to redefine how AI hears us ↓
Old voice search worked like this:

Speech → Text → Search.

If ASR misheard a single word, you got junk results.

Say “The Scream painting” → ASR hears “screen painting” → you get art tutorials instead of Munch.

S2R deletes that middle step completely.
S2R asks a different question.

Not “What did you say?”
But “What are you looking for?”

That’s a philosophical shift from transcription to understanding.
Read 9 tweets
Oct 3
Prompt engineering is dead.

Anthropic just published their internal playbook on what actually matters: context engineering.

Context engineering is what separates agents that work from agents that hallucinate.

Here's what changed: Image
The shift: LLMs don't need more tokens.

They need the right tokens.

Studies show context rot kicks in as windows grow. Every token you add depletes the model's attention budget. More context = worse performance past a threshold.

Think working memory, not hard drive capacity.
Three techniques actually work in production:

Compaction – summarize history, keep what matters
Just-in-time retrieval – agents pull data on demand, not upfront
Sub-agents – specialized models handle focused tasks, return compressed results

Claude Code uses all three. Image
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(