China is on 🔥 ByteDance drops another banger AI paper!
OmniHuman-1 can generate realistic human videos at any aspect ratio and body proportion using just a single image and audio. This is the best i have seen so far.
10 incredible examples and the research paper Link👇
10. Jensen Huang rapping is not something you see often
If you like this post follow me for AI news @ai_for_success and Join my newsletter "AI Compass" for free and get all the latest AI News in your inbox. aicompass.beehiiv.com
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Most sites you actually want to build on don't have APIs.
Mino turns any website into structured data. Send a URL and a goal, get JSON back.
I've been testing this for weeks. It's genuinely powerful - handles logins, dynamic content, multi-step flows. Works on sites that'll never build developer tools.
Easy to use via API. Built-in stealth mode bypasses anti-bot protections.
Production infrastructure - processes millions of operations monthly for pricing intelligence and competitive research.
More use cases below 👇
1/6
2/6 Most browser agents make a model call for every action. Look at screenshot → reason → click → look again. Expensive and slow.
Mino learns workflows once with AI, then executes deterministically.
First run: AI figures out the site
Next runs: Milliseconds, code-level precision
85-95% accuracy, 10-30 seconds per task, pennies per run.
Same infrastructure Google , DoorDash and many more use in production.
3/6
Built a tool that checks 10 urgent care clinics simultaneously in 30 seconds.
Your kid has a fever at 8 PM. Instead of 15 phone calls:
→ Pass clinic URLs + ZIP code
→ Navigates different booking systems in parallel
→ Returns first available slots with booking links
Every clinic has different flows. Mino handles all of it, returns clean JSON.
Official : Google DeepMind has launched Nano Banana Pro aka Gemini 3 Pro Image , I was one of the early testers, Thank you Google DeepMind team.
Some major improvements :
- 1K 2K and 4K image generation
- More accurate legible text in multiple languages
- Better consistency across concepts characters and styles
- Precise localized editing for any part of an image
I asked it to explain Transformer Architecture , this is incredible stuff.
Some prompts and image examples from my early testing 👇
2/7 World knowledge and reasoning
One of the biggest upgrades is world knowledge and reasoning.
Nano Banana Pro can turn
- text into context rich infographics and diagrams
- real world data via Google Search into visual snapshots like recipes weather sports and more
Great for educational content dashboards explainers and product mockups.
I asked it to online research and create me an image with all details of Gemini 3.0 Pro launch .
Prompt : Use live online search to gather the latest accurate information about the launch of Google DeepMind’s Gemini 3.0 Pro from official Google/DeepMind sources and major tech news sites, then synthesize the confirmed facts (launch date, key features, capabilities, improvements vs previous versions, availability, and main use cases) into a single clean, modern infographic image with a clear title like “Gemini 3.0 Pro – Launch Overview,” short readable text (no long paragraphs), simple icons, and 3–6 sections or panels that visually highlight the main points; keep the design professional and balanced, make all text sharp and legible, and avoid adding any details that are not supported by your search results.
3/7 Text and multilingual capabilities
Text inside images is much better now.
Nano Banana Pro supports
- sharper more readable text in posters UI mocks and ads
- a wider range of fonts textures and calligraphy
- multilingual text generation and localization so you can scale visuals across languages and markets
Create a product advertisement showcasing the same wireless headphones in 4 different language versions displayed as a 2x2 grid layout; top-left in English with headline "Sound Without Limits," top-right in Spanish "Sonido Sin Límites," bottom-left in Japanese "限界のないサウンド," and bottom-right in Arabic "صوت بلا حدود"; each version should have identical layout and design (sleek black headphones on gradient background, product name "AuraSound Pro," price, and "Buy Now" button in respective language), use appropriate native fonts for each language (Latin, Japanese characters, Arabic script), ensure all text is crystal clear and professionally typeset with correct grammar and cultural adaptation, maintain consistent branding and visual hierarchy across all four versions, and deliver at 4K resolution to demonstrate true multilingual localization capabilities.
ElevenLabs just launched Scribe v2 Realtime - their next-gen Speech to Text model.
Current STT models force you to choose: fast but inaccurate, accurate but slow, or both but expensive at scale.
Scribe v2 Realtime breaks this tradeoff:
> Ultra-low latency – median latency of 150ms with partial transcriptions
> High accuracy – 93.5% across 30 EU & Asian languages with robust accent handling
> Low cost – optimized for production workloads
Built for live agents, meetings, and conversational AI that needs to work in the real world.
More details 👇
1/5
2/5
Scribe v2 Realtime delivers ultra-low latency STT with 150ms median latency and partial transcriptions in milliseconds.
- Streaming support - send audio chunks, get real-time transcripts
- Voice Activity Detection - automatic segmentation based on silence
- Manual commit control - you decide when to finalize transcript segments
- Multiple audio formats - PCM (8kHz–48kHz) and µ-law encoding
- Speaker diarization - available via manual commit
- Enterprise compliance - SOC 2, PCI, HIPAA, EU data residency ready
3/5
Traditional STT models break conversational flow with delays, misinterpret accents/noisy audio, and cost too much for real-time use.
Scribe v2 Realtime dominates the competition:
- 93.5% accuracy across 30 EU & Asian languages
- Beats Gemini 2.5 Flash (91.4%) with much better latency
- Outperforms GPT 4o MiniTranscribe (90.7%) with superior speed
- Crushes Deepgram Nova 3 (85%) - 73% fewer errors than Nova 2, 55% fewer than Nova 3
- Even beats Deepgram Enhanced - their most expensive model with fewer language options
Nova 2 produces 104% more errors, Nova 3 produces 129% more errors than Scribe v2 Realtime.
Google's new video AI model Veo 3 has native audio generation.
You can now generate videos with sound effects, background noise, and even dialogue with just one prompt
🚨 SkyReels just launched! The world’s first open-source video generation platform supporting unlimited duration 🔥
All-in-one creator toolkit:
- Consistent high-quality video (LoRA ready)
- Fast gen, amazing output.
- Amazing facial expressions .
Plus text-to-film agent handles everything: script, character, storyboards, full AV gen, auto-edit . it's Wild!
Step by step tutorial 👇
1/7 Scripting
Just type your idea or upload a story , it converts everything into a professional screenplay. Then automatically designs characters and storyboards based on your narrative.
2/7 Character Design & Character Editing
The AI analyzes your story's tone and emotion to create matching characters. You can tweak their designs or upload your own , plus it generates unique voices for each character.
Every aspect of characters is customizable - from physical appearance and clothing to hairstyles