Jim Fan Profile picture
Mar 21 5 tweets 3 min read
I can finally discuss something extremely exciting publicly. Jensen just announced NVIDIA AI Foundations:

- Foundation Model as a Service is coming to enterprise, customized for your proprietary data.
- Multimodal from day 1: text LLM is just one part. Bring your images, videos,… twitter.com/i/web/status/1…
Prismer is an example of my team's work on building foundations for multimodal LLM.

GPT-4's vision API is not publicly available yet, and will take much longer to actually become customizable for your enterprise proprietary data and unique use case.

VIMA ("VIsual Motor Attention") is another example of my team's effort to build foundations for multimodal-prompted, robot LLMs.

Folks, multimodal is the future, both for AI research and enterprise-grade applications. Time to go way beyond strings!

To learn more, watch Jensen's GTC Keynote recording here: nvidia.com/gtc/keynote/
If your bandwidth allows, watch in 4K for the stunning graphics: nvidia.com/gtc/keynote/4k/

Attend GTC with us! I will be speaking too.
NVIDIA AI Foundation initiative was built by our company's incredible product teams. I play a small part at NVIDIA Research to create novel algorithms, craft innovative models, and chart new courses. Very grateful and thrilled to be here at the right time!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jim Fan

Jim Fan Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @DrJimFan

Mar 22
10x engineer is a myth. 100x AI-powered engineer is more real than ever. As OpenAI winds down Codex, Microsoft announces GitHub Copilot X. I think it's almost as exciting as GPT-4 itself:

- Copilot Chat: any piece of text database will be "chattable", and codebase is no… twitter.com/i/web/status/1… Image
Copilot for Pull Request needs to be enrolled on a per-repo basis: copilot4prs.githubnext.com/login

3/
Read 7 tweets
Mar 20
Let's talk about the elephant in the room - will LLM take your job?

OpenAI & UPenn conclude that ~80% of the U.S. workforce could have > 10% of work affected, and 19% of workers may see > 50% of work impacted. GPT-4 *itself* actively helps in this study.

What to make of it?🧵 Image credit: https://www.t...
Let's check out some conclusions first. Occupations most vulnerable to LLM impact: tax preparers, interpreters and translators, survey researchers, proofreaders and copy markers, and
BLOCKCHAIN ENGINEERS (wtf, so specific🤣)

2/ Image
Occupations that are not affected at all: mostly manual labor workers. This is very much consistent with the Moravec's paradox: robotics that can automate most physical work reliably is still years away.

3/ Image
Read 13 tweets
Mar 14
GPT-4 is HERE. Most important bits you need to know:

- Multimodal: API accepts images as inputs to generate captions & analyses.
- GPT-4 scores 90th percentile on BAR exam!!! And 99th percentile with vision on Biology Olympiad! Its reasoning capabilities are far more advanced… twitter.com/i/web/status/1… Image
Link to blog: openai.com/product/gpt-4
Research paper: cdn.openai.com/papers/gpt-4.p…
I don't think the API is open to public yet?

2/
My GPT-4 prediction tweet 3 days ago aged like a fine wine: 🍷

3/
Read 4 tweets
Mar 10
*If* GPT-4 is multimodal, we can predict with reasonable confidence what GPT-4 *might* be capable of, given Microsoft’s prior work Kosmos-1:

- Visual IQ test: yes, the ones that humans take!
- OCR-free reading comprehension: input a screenshot, scanned document, street sign, or… twitter.com/i/web/status/1… Image
Source: heise.de/news/GPT-4-is-….
Quote: “The fact that Microsoft is fine-tuning multimodality with OpenAI should no longer have been a secret since the release of Kosmos-1 at the beginning of March.”
It’s surprising that a high-ranking Microsoft official casually made such a… twitter.com/i/web/status/1…
On Feb. 27, 2023, Microsoft Kosmos-1 was announced in this paper: "Language is Not All You need: Aligning Perception with Language Models."

arxiv.org/abs/2302.14045

3/ Image
Read 4 tweets
Mar 9
In the Transformer movies, 9 Decepticons merge to form “Devastator”, a much larger and stronger bot.

This turns out to be a powerful paradigm for multimodal LLM too. Instead of a monolithic Transformer, we can stack many pre-trained experts into one.

My team’s work, Prismer, is… twitter.com/i/web/status/1…
Here is a sample multimodal dialogue from Visual ChatGPT:

2/
Because there are no trainable parameters, this whole system requires extensive prompt engineering, chain of thoughts, and dialogue history book-keeping. Here's the overall system design figure:

3/
Read 5 tweets
Mar 7
After ChatGPT, the future belongs to multimodal LLMs. What’s even better? Open-sourcing.

Announcing Prismer, my team’s latest vision-language AI, empowered by domain-expert models in depth, surface normal, segmentation, etc.

No paywall. No forms. shikun.io/projects/prism…twitter.com/i/web/status/1…
The typical multimodal LLM is trained on massive amounts of image-text data to produce one giant, monolithic model. It could be extremely data-inefficient and computationally expensive. Prismer takes a novel path: why not stand on the shoulders of pre-trained visual experts?

2/
There’re lots of expert computer vision models that parse raw images into semantically meaningful outputs, such as depth, OCR, object bounding boxes, etc. Their weights capture a wealth of visual knowledge and reasoning capabilities. It’d be a big waste not to integrate them.

3/
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(