Tweet

twitter.com/i/web/status/1…

https://twitter.com/DrJimFan/status/1633179734803890177?s=20

https://twitter.com/DrJimFan/status/1578433493561769984?s=20

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @DrJimFan

Jim Fan

@DrJimFan

Mar 22

twitter.com/i/web/status/1…

10x engineer is a myth. 100x AI-powered engineer is more real than ever. As OpenAI winds down Codex, Microsoft announces GitHub Copilot X. I think it's almost as exciting as GPT-4 itself:

- Copilot Chat: any piece of text database will be "chattable", and codebase is no… twitter.com/i/web/status/1…

Copilot Chat: github.com/github-copilot…

2/

Copilot for Pull Request needs to be enrolled on a per-repo basis: copilot4prs.githubnext.com/login

3/

Read 7 tweets

Jim Fan

@DrJimFan

Mar 20

Let's talk about the elephant in the room - will LLM take your job?

OpenAI & UPenn conclude that ~80% of the U.S. workforce could have > 10% of work affected, and 19% of workers may see > 50% of work impacted. GPT-4 *itself* actively helps in this study.

What to make of it?🧵

Let's check out some conclusions first. Occupations most vulnerable to LLM impact: tax preparers, interpreters and translators, survey researchers, proofreaders and copy markers, and
BLOCKCHAIN ENGINEERS (wtf, so specific🤣)

2/

Occupations that are not affected at all: mostly manual labor workers. This is very much consistent with the Moravec's paradox: robotics that can automate most physical work reliably is still years away.

3/

Read 13 tweets

Jim Fan

@DrJimFan

Mar 14

twitter.com/i/web/status/1…

GPT-4 is HERE. Most important bits you need to know:

- Multimodal: API accepts images as inputs to generate captions & analyses.
- GPT-4 scores 90th percentile on BAR exam!!! And 99th percentile with vision on Biology Olympiad! Its reasoning capabilities are far more advanced… twitter.com/i/web/status/1…

Link to blog: openai.com/product/gpt-4
Research paper: cdn.openai.com/papers/gpt-4.p…
I don't think the API is open to public yet?

2/

My GPT-4 prediction tweet 3 days ago aged like a fine wine: 🍷

3/

https://twitter.com/drjimfan/status/1634244545360609289?s=46&t=_0-dBohQmo_tiUU9fcgIeA

Read 4 tweets

Jim Fan

@DrJimFan

Mar 10

twitter.com/i/web/status/1…

*If* GPT-4 is multimodal, we can predict with reasonable confidence what GPT-4 *might* be capable of, given Microsoft’s prior work Kosmos-1:

- Visual IQ test: yes, the ones that humans take!
- OCR-free reading comprehension: input a screenshot, scanned document, street sign, or… twitter.com/i/web/status/1…

twitter.com/i/web/status/1…

Source: heise.de/news/GPT-4-is-….
Quote: “The fact that Microsoft is fine-tuning multimodality with OpenAI should no longer have been a secret since the release of Kosmos-1 at the beginning of March.”
It’s surprising that a high-ranking Microsoft official casually made such a… twitter.com/i/web/status/1…

On Feb. 27, 2023, Microsoft Kosmos-1 was announced in this paper: "Language is Not All You need: Aligning Perception with Language Models."

arxiv.org/abs/2302.14045

3/

Read 4 tweets

Jim Fan

@DrJimFan

Mar 9

twitter.com/i/web/status/1…

In the Transformer movies, 9 Decepticons merge to form “Devastator”, a much larger and stronger bot.

This turns out to be a powerful paradigm for multimodal LLM too. Instead of a monolithic Transformer, we can stack many pre-trained experts into one.

My team’s work, Prismer, is… twitter.com/i/web/status/1…

Here is a sample multimodal dialogue from Visual ChatGPT:

2/

Because there are no trainable parameters, this whole system requires extensive prompt engineering, chain of thoughts, and dialogue history book-keeping. Here's the overall system design figure:

3/

Read 5 tweets

Jim Fan

@DrJimFan

Mar 7

twitter.com/i/web/status/1…

After ChatGPT, the future belongs to multimodal LLMs. What’s even better? Open-sourcing.

Announcing Prismer, my team’s latest vision-language AI, empowered by domain-expert models in depth, surface normal, segmentation, etc.

No paywall. No forms. shikun.io/projects/prism…… twitter.com/i/web/status/1…

The typical multimodal LLM is trained on massive amounts of image-text data to produce one giant, monolithic model. It could be extremely data-inefficient and computationally expensive. Prismer takes a novel path: why not stand on the shoulders of pre-trained visual experts?

2/

There’re lots of expert computer vision models that parse raw images into semantically meaningful outputs, such as depth, OCR, object bounding boxes, etc. Their weights capture a wealth of visual knowledge and reasoning capabilities. It’d be a big waste not to integrate them.

3/

Read 9 tweets

Share this page!

Jim Fan

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @DrJimFan

Jim Fan

Jim Fan

Jim Fan

Jim Fan

Jim Fan

Jim Fan

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!