I can finally discuss something extremely exciting publicly. Jensen just announced NVIDIA AI Foundations:
- Foundation Model as a Service is coming to enterprise, customized for your proprietary data.
- Multimodal from day 1: text LLM is just one part. Bring your images, videos,… twitter.com/i/web/status/1…
Prismer is an example of my team's work on building foundations for multimodal LLM.
GPT-4's vision API is not publicly available yet, and will take much longer to actually become customizable for your enterprise proprietary data and unique use case.
NVIDIA AI Foundation initiative was built by our company's incredible product teams. I play a small part at NVIDIA Research to create novel algorithms, craft innovative models, and chart new courses. Very grateful and thrilled to be here at the right time!
• • •
Missing some Tweet in this thread? You can try to
force a refresh
10x engineer is a myth. 100x AI-powered engineer is more real than ever. As OpenAI winds down Codex, Microsoft announces GitHub Copilot X. I think it's almost as exciting as GPT-4 itself:
- Copilot Chat: any piece of text database will be "chattable", and codebase is no… twitter.com/i/web/status/1…
Let's talk about the elephant in the room - will LLM take your job?
OpenAI & UPenn conclude that ~80% of the U.S. workforce could have > 10% of work affected, and 19% of workers may see > 50% of work impacted. GPT-4 *itself* actively helps in this study.
What to make of it?🧵
Let's check out some conclusions first. Occupations most vulnerable to LLM impact: tax preparers, interpreters and translators, survey researchers, proofreaders and copy markers, and
BLOCKCHAIN ENGINEERS (wtf, so specific🤣)
2/
Occupations that are not affected at all: mostly manual labor workers. This is very much consistent with the Moravec's paradox: robotics that can automate most physical work reliably is still years away.
GPT-4 is HERE. Most important bits you need to know:
- Multimodal: API accepts images as inputs to generate captions & analyses.
- GPT-4 scores 90th percentile on BAR exam!!! And 99th percentile with vision on Biology Olympiad! Its reasoning capabilities are far more advanced… twitter.com/i/web/status/1…
*If* GPT-4 is multimodal, we can predict with reasonable confidence what GPT-4 *might* be capable of, given Microsoft’s prior work Kosmos-1:
- Visual IQ test: yes, the ones that humans take!
- OCR-free reading comprehension: input a screenshot, scanned document, street sign, or… twitter.com/i/web/status/1…
Source: heise.de/news/GPT-4-is-….
Quote: “The fact that Microsoft is fine-tuning multimodality with OpenAI should no longer have been a secret since the release of Kosmos-1 at the beginning of March.”
It’s surprising that a high-ranking Microsoft official casually made such a… twitter.com/i/web/status/1…
On Feb. 27, 2023, Microsoft Kosmos-1 was announced in this paper: "Language is Not All You need: Aligning Perception with Language Models."
Here is a sample multimodal dialogue from Visual ChatGPT:
2/
Because there are no trainable parameters, this whole system requires extensive prompt engineering, chain of thoughts, and dialogue history book-keeping. Here's the overall system design figure:
The typical multimodal LLM is trained on massive amounts of image-text data to produce one giant, monolithic model. It could be extremely data-inefficient and computationally expensive. Prismer takes a novel path: why not stand on the shoulders of pre-trained visual experts?
2/
There’re lots of expert computer vision models that parse raw images into semantically meaningful outputs, such as depth, OCR, object bounding boxes, etc. Their weights capture a wealth of visual knowledge and reasoning capabilities. It’d be a big waste not to integrate them.