Daniel Bourke Profile picture
Dec 3 12 tweets 7 min read Read on X
New video: Tracking every item in my house with video using Google Gemini 🎥 -> 🛋️

I call it "KeepTrack" 😎

Input: 10-minute casual walk around video.
Output: Structured database w/ 70+ items.
Cost: ~$0.07 w/ caching, ~$0.10 w/o caching.

Full details... 🧵
TL;DR

Gemini features which make this possible:

1. Video processing.
2. Long context windows (video data = ~300 tokens per second, 10 minute video = 165,000 tokens).
3. Context caching (process inputs once, inference for 4x cheaper).

Prices are with Gemini 1.5 Flash. Bar chart comparing the total costs of using cached versus non-cached input tokens for processing a large video file three times. The cached method costs $0.0741937875, which is 17.46% cheaper than the non-cached method costing $0.08989155
1. Video Processing/Prompting

Intuition: Show *and* tell.
Technical: Video = 4 modalities in one: vision, audio, time, text (Gemini can read text in video frames).

Instead of writing down every item in my house, I just walked through pointing at things and talking about them.

Gemini tracked everything I said/saw almost flawlessly (it missed a few things due to 1 FPS sampling but this will get better).

Doing this via text/photos alone would've taken much longer.

There are many more fields/problems where video input unlocks a whole new range of possibilities.Screenshot of a documentation page for Google AI's Gemini API, focusing on prompting with video. It includes technical details for supported video formats, such as MP4, MPEG, MOV, AVI, and others, and mentions that the File API is required for uploading large video files.
2. Long Context Windows

Generative models like Gemini generate outputs based on their inputs.

And Gemini’s long context windows mean they can create outputs based on incredibly large amounts of input data.

Gemini Flash can handle 1 million input tokens (a token is a small piece of input information whether text, image or audio).

And Gemini Pro can handle 2 million input tokens.

This long context window is important in our use case because video data requires far more tokens than words.

In KeepTrack, our 4000 word text-based instructions equals ~11,000 tokens.

However, our 10 minute 30FPS 720p house tour video equals ~165,000 tokens (~300 tokens per frame).

This means that even with 176,000 input tokens, Gemini still has plenty of room to move if we needed to input a longer video or more instructions.

Without a long context window input, our use case wouldn’t be possible.

From a developer standpoint, the benefits of a long context window mean less data preprocessing and wrangling and more letting the model do the heavy lifting.

And from an end user standpoint, it means there’s no multi-step process on the frontend to upload separate pieces of data.
3. Context Caching

Context caching = process information once, store in cache, perform several passes for 4x less.

In KeepTrack, we perform three passes of the same video to generate an accurate structured data output of household items.

Rather than process the video from scratch for each pass, context caching means we can process it once and use it as inputs to a subsequent step at a far lower cost.

From a developer standpoint, context caching means 4x cheaper input token processing.

From an end user standpoint, context caching means more accurate results due to the Gemini being able to do multiple rounds of looking at the data.

Thanks to context caching, processing a 10-minute video 3x and extracting high-quality itemised information cost less than 10c.
So how does it happen?

We use three major steps:

1. Video + initial prompt + examples -> check outputs, fix if necessary.
2. Video + secondary prompt + examples -> check outputs, fix if necessary.
3. Video + final prompt + examples -> check outputs, fix if necessary.

And two major model instances:

1. Gemini Model for doing the video inference (with a different input prompt each step).
2. Gemini Model for fixing the CSV if it fails validation checks (the check is the same for each step).

Each builds upon the previous the outputs of the previous step.

Step 1 produces the initial information extraction and details.

Step 2 takes step 1's outputs and tries to expand on them if necessary.

Step 3 reviews the combined outputs of step 1 and 2 and finalizes them.

All major steps use the same Gemini model instance with a context cached video input.

Each step has a verification step to make sure its outputs are valid (e.g. fix the CSV outputs if simple programmatic checks fail to make sure they are formatted correctly).Flowchart showing a workflow for video information extraction and CSV format validation using Gemini models. Steps include initial extraction with token prompts, CSV validation, fixing invalid CSVs, and expanding/finalizing extraction with secondary and final prompts, leading to final results.
Bonus: Gemini can do bounding boxes :D

Even works on:
- blurry frames (some frames are blurry due to 1 FPS sampling)
- kind of obscure items like "weight bag"
- multiple items (e.g. 3x outdoor benches)
- items in packaging (e.g. folding chairs in their bags)

All boxes were output by Gemini given only the frame of the video at the timestamp where Gemini said the item was + the item name Gemini said was at the timestamp.Bounding box detection on a green arm chair in a blurry image
bounding box detection on a weight bag in a gym
bounding box detection on 3 outdoor benches in an outdoor setting
bounding box detection on 2 folding chairs in their bags
Bonus 2: I created a barebones web app to inspect/correct the results.

Most of them were more than enough for simple record keeping.
Tidbits & takeaways. Text-based guide with tips on using Gemini AI for video data extraction and processing. Covers topics like XML/HTML tag formatting, CSV vs JSON for output, bounding box formatting, video sampling rates, caching strategies, experimenting workflows, and creating task-specific models. Ends with emphasizing visualization and recent advancements.
Future avenues/other things I tried:

• Creating a "story mode"/"memory palace" then turning the story into structured data actually worked *really* well. Example: "watch this video and create a memory palace story of all the major items" Screenshot of notes titled "Future avenues" exploring ideas for improving video processing with Gemini AI. Topics include generating story-like outputs, schema automation, leveraging long context windows, evaluating outputs without audio, improving evaluations, and enhancing FPS usage for more frame coverage.
Why do this?

1. Fun :D

3. I actually had this problem. Trying to sell my place/get a small office/apply for different insurance and they asked "how much should we insure you for?" and so I decided to find out an actual answer.

Things I've tried this workflow on: office, storage shed, home, record collection (works quite well).

4. This was my entry to a Kaggle competition to showcase the Gemini Long Context window.
All code + prompts + data is available.

You could replicate this similar workflow by changing the Gemini API key + the input video (or try it with my video).

• Code + prompts: kaggle.com/code/mrdbourke…
• Data: kaggle.com/datasets/mrdbo…
• Original competition: kaggle.com/competitions/g…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Daniel Bourke

Daniel Bourke Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @mrdbourke

Dec 10, 2021
If machine learning projects were a relationship...

Data collecting and processing is the dating phase, fun, chaotic, up and down, tormenting and carefree, seeing if you're a good fit.

Modelling is the wedding day, takes forever to plan, over before you know it.
People using your model is the honeymoon.

Then comes the data drift.

Your data changes like the person you thought you married, maybe they're getting fat (distribution changes) or they're finding it hard to love you (your data features are no longer ideal).
So you bring in data monitoring, model evaluation (marriage counselling) and pull all the tricks.

Your marriage counsellor tells you to go back to what got you started.

The fun dates (collecting data), talking for hours learning about each other (processing data).
Read 7 tweets
Jul 6, 2021
Friends, the latest issue(s) of Machine Learning Monthly (June 2021) have been delivered beautifully to your inbox/YouTube subscription page.

Quick takes below.
1. Text-based Video Editing

This is wild.

Not too often things make me go "Woah".

This did.

I'd never even thought of this kind of editing.

Chop and change videos using pure text transcriptions.

• Project page: davidyao.me/projects/text2… Text-based video editing
2. Cracking the Machine Learning Interview

The wonderful @chipro has done it again with an outstanding book on the machine learning interview process.

Getting a job is a skill in itself.

Lucky you've now got Chip's book to help you.

• Read it free: huyenchip.com/ml-interviews-… Chip Huyen's online machine learning interviews book
Read 5 tweets
Jun 8, 2021
Machine Learning Monthly for May 2021 is live (video & audio)!

The latest and greatest (but not always the latest) from the machine learning world in the past month + plenty of dancing.

This month we've got...
Huuuuuuge updates to @TensorFlow:

• TensorFlow Lite models now work with TensorFlow.js (train once, deploy twice)

• Google's on-device machine learning page tailors ML guides for your smaller device needs

• TF Lite model maker library helps you train on-device models faster
• TensorFlow Hub gets a facelift, plus, now you can try pretrained models before you buy them (jk the models are free)

• TensorFlow Cloud library helps you scale up your smaller experiments to cloud-scale in a few lines of code (e.g. Google Colab -> 8 GPUs) Use TensorFlow Cloud to scale up your machine learning model
Read 10 tweets
Apr 27, 2021
Outstanding post by @marksaroufim

He put into words something I’ve been thinking but didn’t quite know why.

@huggingface & @weights_biases are two of my favourite ML companies.

Why?

Because like @fastdotai...

They create community.
I’d also add @roboflow into the mix of my favourite up and coming ML companies.

People like people.

Roboflow are making things and sharing ideas directly from the engineers/founders.

It’s good to relate to the people behind the product.
Not to mention the memes throughout this post are worth their weight in gold.

This one describes perfectly describes my last 3 years online (except replace Twitter w/ YouTube).

Note for myself going forward: leverage product off media/community base. Gaussian plot with disturbed ML product manager at the mean
Read 5 tweets
Jul 31, 2020
1/ Introducing the 2020 #machinelearning roadmap:

An interactive mindmap which connects many (not all) of the most important concepts in machine learning.

Map: dbourke.link/mlmap
Video walkthrough:
Accompanying slides: github.com/mrdbourke/mach… machine learning mindmap th...
2/ In the map you'll find 5 branches:

1. 🤔 Problems - some of the main use cases for ML.
2. ♻️ Process - what does a solution look like?
3. 🛠 Tools - how can you build your solution?
4. 🧮 Math - ML is applied mathematics, what kind?
5. 📚 Resources - where to learn the above.
3/ Although very colorful, at first glance, the map can be very intimidating.

So there's a video walkthrough to go along with it:

We start with a high level overview which answers questions like "what is machine learning good for?" what is machine learning go...
Read 13 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(