Jeff Dean (@🏡) Profile picture
Feb 15 17 tweets 15 min read Read on X
Gemini 1.5 Pro - A highly capable multimodal model with a 10M token context length

Today we are releasing the first demonstrations of the capabilities of the Gemini 1.5 series, with the Gemini 1.5 Pro model. One of the key differentiators of this model is its incredibly long context capabilities, supporting millions of tokens of multimodal input. The multimodal capabilities of the model means you can interact in sophisticated ways with entire books, very long document collections, codebases of hundreds of thousands of lines across hundreds of files, full movies, entire podcast series, and more.

Gemini 1.5 was built by an amazing team of people from @GoogleDeepMind, @GoogleResearch, and elsewhere at @Google. @OriolVinyals (my co-technical lead for the project) and I are incredibly proud of the whole team, and we’re so excited to be sharing this work and what long context and in-context learning can mean for you today!

There’s lots of material about this, some of which are linked to below.

Main blog post:


Technical report:
“Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context”


Videos of interactions with the model that highlight its long context abilities:
Understanding the three.js codebase:
Analyzing a 45 minute Buster Keaton movie:
Apollo 11 transcript interaction:

Starting today, we’re offering a limited preview of 1.5 Pro to developers and enterprise customers via AI Studio and Vertex AI. Read more about this on these blogs:
Google for Developers blog:

Google Cloud blog:


We’ll also introduce 1.5 Pro with a standard 128,000 token context window when the model is ready for a wider release. Coming soon, we plan to introduce pricing tiers that start at the standard 128,000 context window and scale up to 1 million tokens, as we improve the model.

Early testers can try the 1 million token context window at no cost during the testing period. We’re excited to see what developer’s creativity unlocks with a very long context window.

Let me walk you through the capabilities of the model and what I’m excited about!blog.google/technology/ai/…
goo.gle/GeminiV1-5



developers.googleblog.com/2024/02/gemini…
cloud.google.com/blog/products/…Image
Needle in a Haystack Tests Out to 10M Tokens

First, let’s take a quick glance at a needle-in-a-haystack test across many different modalities to exercise Gemini 1.5 Pro’s ability to retrieve information from its very long context. In these tests, green is good, and red is not good, and these are almost entirely green (>99.7% recall), even out to 10M tokens. Great! A bit more on needle-in-a-haystack tests later in the thread.Image
Analyzing and understanding complex code bases

The model’s ability to quickly ingest large code bases and answer complex questions is pretty remarkable. three.js is a 3D Javascript library of about 100,000 lines of code, examples, documentation, etc. ()

With this codebase in context, the system can help the user understand the code, and can make modifications to complex demonstrations based on high-level human specifications (“show me some code to add a slider to control the speed of the animation. use that kind of GUI the other demos have”), or pinpoint the exact part of the codebase to change the height of some generated terrain in a different demo.

Watch this video!

threejs.org
Navigating large and unfamiliar code bases

As another code-related example, imagine you’re unfamiliar with a large codebase and want the model to help you understand the code or find where particular functionality is implemented. In this example the model is able to ingest the entire 116 file JAX code base (746k tokens) and help the user identify the specific spot in the code that implements the backwards pass for autodifferentiation.

It’s easy to see how the long context capabilities can be invaluable when diving in to an unfamiliar code base or working with one that you use every day. Many Gemini team members have been finding it very useful to use Gemini 1.5 Pro’s long context capabilities on our Gemini code base.Image
Analyzing and reasoning about videos

Analyzing videos is another great capability brought by the fact that Gemini models are naturally multimodal, and this becomes even more compelling with long contexts. Consider Gemini 1.5 Pro’s ability to analyze movies, like Buster Keaton’s silent 45 minute “Sherlock Jr.” movie. Using one frame per second, we turn this into an input context of 684k tokens. The model can then answer fairly complex questions about the video content, such as:

“Tell me some key information from the piece of paper that is removed from the person’s pocket, and the timecode of that moment.”

Or, a very cursory line drawing of something that happened, combined with “What is the timecode when this happens?”

You can see this interaction here:
Image
Image
Image
Reasoning about long text documents

The model also shines at analyzing long, complex text documents, like Victor Hugo’s five volume novel “Les Misérables” (1382 pages, 732k tokens). The multimodal capabilities are demonstrated by coarsely sketching a scene and saying “Look at the event in this drawing. What page is this on?”Image
Image
Reasoning about long text documents (cont)

Gemini 1.5 Pro can analyze and summarize the 402-page transcripts from Apollo 11’s mission to the moon.

“One small step for man, one giant leap for mankind.”

Kalamang Translation

One of the most exciting examples in the report involves translation of Kalamang. Kalamang is a language spoken by fewer than 200 speakers in western New Guinea in the east of Indonesian Papua (). Kalamang has almost no online presence. Machine Translation from One Book (MTOB: ) is a recently introduced benchmark evaluating the ability of a learning system to learn to translate Kalamang from just a single book.

Eline Visser () wrote a 573 page book “A Grammar of Kalamang” ().
Thank you, Eline! The text of this book is used in the MTOB benchmark. With this book and a simple bilingual wordlist (dictionary) provided in context, Gemini 1.5 Pro can learn to translate from English to/from Kalamang. Without the Kalamang materials in context, the 1.5 Pro model produces almost random translations. However, with the materials in the context window, Gemini Pro 1.5 is able to use in-context learning about Kalamang, and we find that the quality of its translations is comparable to that of a person who has learned from the same materials. With a ChrF of 58.3 for English->Kalamang translation, Gemini Pro 1.5 improves substantially over the best model score of 45.8 ChrF and also slightly exceeds the human baseline of 57.0 ChrF reported in the MTOB paper ().

The possibilities for significantly improving translation for very low resource languages are quite exciting!endangeredlanguages.com/lang/1891
arxiv.org/abs/2309.16575
elinevisser23.github.io
github.com/langsci/344/bl…
arxiv.org/abs/2309.16575Image
Image
Needle in a Haystack tests

The tech report also details a number of microbenchmark “needle in a haystack” tests (modeled after @GregKamradt’s ) that probe the model’s ability to retrieve specific information from its context.

For text, Gemini 1.5 Pro achieves 100% recall up to 530k tokens, 99.7% up to 1M tokens, and 99.2% accuracy up to 10M tokens.github.com/gkamradt/LLMTe…Image
Audio haystack

For audio, Gemini 1.5 Pro achieves 100% recall when looking for different audio needles hidden in ~11 hours of audio.Image
Video haystack

For video, Gemini 1.5 Pro achieves 100% recall when looking for different visual needles hidden in ~3 hours of video.Image
Multineedle in haystack test

We also created a generalized version of the needle in a haystack test, where the model must retrieve 100 different needles hidden in the context window. For this, we see that Gemini 1.5 Pro’s performance is above that of GPT-4 Turbo at small context lengths and remains relatively steady across the entire 1M context window, while the GPT-4 Turbo model drops off more quickly (and cannot go past 128k tokens).Image
Basic Capabilities

We’ve been able to add the long context ability without sacrificing on other capabilities. Although it required a small fraction of the training time of the Gemini 1.0 Ultra model, the 1.5 Pro model exceeds the performance of the 1.0 Ultra model on 17 of 31 benchmarks. Compared to the 1.0 Pro model, the 1.5 Pro model exceeds its performance on 27 of 31 benchmarks.

Tables 8 and 10 in the tech report have lots more details.Image
Image
Mixture of Experts

Gemini 1.5 Pro uses a mixture-of-expert (MoE) architecture, building on a long line of Google research efforts on sparse models:

2017: Shazeer et al.. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ICLR 2017.
2020: Lepikhin et al., GShard: Scaling giant models with conditional computation and automatic sharding. ICLR 2020.
2021: Carlos Riquelme et al., Scaling vision with sparse mixture of experts, NeurIPS 2021.
2021: Fedus et al., Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. JMLR 2022.
Clark et al., Unified scaling laws for routed language models, ICML 2022.
2022: Zoph et al., Designing effective sparse expert models. arxiv.org/abs/1701.06538
arxiv.org/abs/2006.16668
arxiv.org/abs/2106.05974
arxiv.org/abs/2101.03961
arxiv.org/abs/2202.01169
arxiv.org/abs/2202.08906
Gratitude

It’s really great to get to work with such an awesome set of colleagues on a project like Gemini.

@OriolVinyalsML and I just want to say a heartfelt thank you to everyone on the team for the hard work on Gemini so far, and I’m so proud of what we’ve built together! 🙏
Seeing farther

Sometimes when you're looking at something, you're a bit too close to it to see important aspects.

The long context work enables us to all see a bit further.Image
Image
We are very excited to be able to share our work on this next iteration of Gemini. We can’t wait to see how people creatively use the long context capabilities to do all kinds of things that previously were not possible. How would you use these capabilities?

And now, back to some other things we’re ultra excited about!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jeff Dean (@🏡)

Jeff Dean (@🏡) Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @JeffDean

Dec 6, 2023
I’m very excited to share our work on Gemini today! Gemini is a family of multimodal models that demonstrate really strong capabilities across the image, audio, video, and text domains. Our most-capable model, Gemini Ultra, advances the state of the art in 30 of 32 benchmarks, including 10 of 12 popular text and reasoning benchmarks, 9 of 9 image understanding benchmarks, 6 of 6 video understanding benchmarks, and 5 of 5 speech recognition and speech translation benchmarks. Gemini Ultra is the first model to achieve human-expert performance on MMLU across 57 subjects with a score above 90%. It also achieves a new state-of-the-art score of 62.4% on the new MMMU multimodal reasoning benchmark, outperforming the previous best model by more than 5 percentage points.

Gemini was built by an awesome team of people from @GoogleDeepMind, @GoogleResearch, and elsewhere at @Google, and is one of the largest science and engineering efforts we’ve ever undertaken. As one of the two overall technical leads of the Gemini effort, along with my colleague @OriolVinyalsML, I am incredibly proud of the whole team, and we’re so excited to be sharing our work with you today!

There’s quite a lot of different material about Gemini available, starting with:

Main blog post:

60-page technical report authored by th Gemini Team:

In this thread, I’ll walk you through some of the highlights.

Image
Image
The multimodal and reasoning capabilities of Gemini are quite strong. The benchmark results, which I’ll discuss in a moment are nice, but I’m most excited by demonstrations of what it can do.

Consider the image below. A teacher has drawn a physics problem of a skier going down a slope, and a student has worked through a solution to computing the speed of the skier at the bottom of the slope. Using Gemini’s multimodal reasoning capabilities, the model is able to read the messy handwriting, correctly understand the problem formulation, convert both the problem and solution to mathematical typesetting, identify the specific step of reasoning where the student went wrong in solving the problem, and then give a worked through correct solution to the problem. The possibilities in education alone are exciting, and these multimodal and reasoning capabilities of Gemini models could have dramatic applications across many fields.
Image
There are a number of short videos that discuss Gemini and demonstrate its capabilities:

Overview:

Gemini extracting relevant information from tens of thousands of scientific papers:

Highlights of the native multimodality of Gemini with audio and images:

A version of AlphaCode built on top of Gemini that performs in the top 15% of competitors in competitive programming:

Gemini helping a parent and student with their physics homework:

Gemini creating bespoke UIs that are contextual and relevant to an ongoing conversation:

Gemini’s approach to Responsible AI:

A full set of demos is at:

Read 20 tweets
Jan 18, 2023
Excited to share the first of a series of @GoogleAI blog posts summarizing our research work from 2022. This covers language & multimodal models, computer vision, and generative models. We'll have ~7 posts covering other areas over next few weeks!

ai.googleblog.com/2023/01/google… Image
A huge thanks to all of the @GoogleResearch community whose great work is represented in this post as well as the subsequent posts in the series. 🙏
Part 2 in the series is out. This one, by @marian_croak on behalf of many, highlights the work we've done in the responsible use of AI, both advancing fundamental research in this area as well as applying these approaches to many of our products.

ai.googleblog.com/2023/01/google…
Read 5 tweets
May 11, 2022
Today at #GoogleIO @sundarpichai showed some examples of the capabilities of the PaLM 540B language model. For example, you can prompt the model with:

"I will ask a question in Bengali and get English and Bengali answers"

And then give it two examples of this behavior.

(cont)
You can then ask novel questions in Bengali, and get surprisingly good answers on both English and Bengali:
Why is this surprising and powerful?

Model has never seen parallel sentences between Bengali and English (or any other explicit parallel training data)

Model has never been taught to explicitly answer questions

(cont)
Read 6 tweets
Jan 11, 2022
As in past years, I've spent part of the holiday break summarizing much of the work we've done in @GoogleResearch over the last year. On behalf of @Google's research community, I'm delighted to share this writeup (this year grouped into five themes).

ai.googleblog.com/2022/01/google…
The material covered in the post represents the work of a tremendous number of people from all across @GoogleResearch, @Google, and beyond (through our many collaborations).

And a huge thank you to the many people listed at the end of the post who helped with this!
This year I broke things up into five trends:

Trend 1: More Capable, General-Purpose ML Models

ai.googleblog.com/2022/01/google…
Read 16 tweets
Jul 5, 2021
It's often helpful to have people to turn to for advice, suggestions, etc., no matter where you are in your career.  Many great programs exist to help with this in CS, AI/ML, and related fields.  I encourage people to look at any or all of these that might be appropriate:
@black_in_aiblackinai.github.io/#/programs/sum…

@DeepIndabadeeplearningindaba.com/mentorship/
Local DeepIndabaX groups such as @IndabaxEgypt, DeepIndabaX_ZA @DeepIndabaX_MZ, etc. may also have mentoring programs.
Read 8 tweets
Jan 12, 2021
On behalf of the entire Google Research & @GoogleAI communities, I'm excited to share an overview of some of our research in 2020.

Thanks to everyone who helped make this work possible!

ai.googleblog.com/2021/01/google…
The post touches on many areas, including COVID-19, Health, Weather & Climate Change, Accessibility, Responsible AI, Natural Language Understanding, ML Algorithms, Applications of ML to Science, Machine Perception, Robotics, Algorithmic Theory, Open Datasets, and more.
(So hopefully there's something interesting for everyone!)

Many of the links in the post itself point off to papers or other material with much more detailed information about particular topics.
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(