Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Daniel Bourke

@mrdbourke

Dec 3 • 12 tweets • 7 min read • Read on X

Scrolly

New video: Tracking every item in my house with video using Google Gemini 🎥 -> 🛋️

I call it "KeepTrack" 😎

Input: 10-minute casual walk around video.
Output: Structured database w/ 70+ items.
Cost: ~$0.07 w/ caching, ~$0.10 w/o caching.

Full details... 🧵

TL;DR

Gemini features which make this possible:

1. Video processing.
2. Long context windows (video data = ~300 tokens per second, 10 minute video = 165,000 tokens).
3. Context caching (process inputs once, inference for 4x cheaper).

Prices are with Gemini 1.5 Flash.

1. Video Processing/Prompting

Intuition: Show *and* tell.
Technical: Video = 4 modalities in one: vision, audio, time, text (Gemini can read text in video frames).

Instead of writing down every item in my house, I just walked through pointing at things and talking about them.

Gemini tracked everything I said/saw almost flawlessly (it missed a few things due to 1 FPS sampling but this will get better).

Doing this via text/photos alone would've taken much longer.

There are many more fields/problems where video input unlocks a whole new range of possibilities.

2. Long Context Windows

Generative models like Gemini generate outputs based on their inputs.

And Gemini’s long context windows mean they can create outputs based on incredibly large amounts of input data.

Gemini Flash can handle 1 million input tokens (a token is a small piece of input information whether text, image or audio).

And Gemini Pro can handle 2 million input tokens.

This long context window is important in our use case because video data requires far more tokens than words.

In KeepTrack, our 4000 word text-based instructions equals ~11,000 tokens.

However, our 10 minute 30FPS 720p house tour video equals ~165,000 tokens (~300 tokens per frame).

This means that even with 176,000 input tokens, Gemini still has plenty of room to move if we needed to input a longer video or more instructions.

Without a long context window input, our use case wouldn’t be possible.

From a developer standpoint, the benefits of a long context window mean less data preprocessing and wrangling and more letting the model do the heavy lifting.

And from an end user standpoint, it means there’s no multi-step process on the frontend to upload separate pieces of data.

3. Context Caching

Context caching = process information once, store in cache, perform several passes for 4x less.

In KeepTrack, we perform three passes of the same video to generate an accurate structured data output of household items.

Rather than process the video from scratch for each pass, context caching means we can process it once and use it as inputs to a subsequent step at a far lower cost.

From a developer standpoint, context caching means 4x cheaper input token processing.

From an end user standpoint, context caching means more accurate results due to the Gemini being able to do multiple rounds of looking at the data.

Thanks to context caching, processing a 10-minute video 3x and extracting high-quality itemised information cost less than 10c.

So how does it happen?

We use three major steps:

1. Video + initial prompt + examples -> check outputs, fix if necessary.
2. Video + secondary prompt + examples -> check outputs, fix if necessary.
3. Video + final prompt + examples -> check outputs, fix if necessary.

And two major model instances:

1. Gemini Model for doing the video inference (with a different input prompt each step).
2. Gemini Model for fixing the CSV if it fails validation checks (the check is the same for each step).

Each builds upon the previous the outputs of the previous step.

Step 1 produces the initial information extraction and details.

Step 2 takes step 1's outputs and tries to expand on them if necessary.

Step 3 reviews the combined outputs of step 1 and 2 and finalizes them.

All major steps use the same Gemini model instance with a context cached video input.

Each step has a verification step to make sure its outputs are valid (e.g. fix the CSV outputs if simple programmatic checks fail to make sure they are formatted correctly).

Bonus: Gemini can do bounding boxes :D

Even works on:
- blurry frames (some frames are blurry due to 1 FPS sampling)
- kind of obscure items like "weight bag"
- multiple items (e.g. 3x outdoor benches)
- items in packaging (e.g. folding chairs in their bags)

All boxes were output by Gemini given only the frame of the video at the timestamp where Gemini said the item was + the item name Gemini said was at the timestamp.

Bonus 2: I created a barebones web app to inspect/correct the results.

Most of them were more than enough for simple record keeping.

Tidbits & takeaways.

Future avenues/other things I tried:

• Creating a "story mode"/"memory palace" then turning the story into structured data actually worked *really* well. Example: "watch this video and create a memory palace story of all the major items"

Why do this?

1. Fun :D

3. I actually had this problem. Trying to sell my place/get a small office/apply for different insurance and they asked "how much should we insure you for?" and so I decided to find out an actual answer.

Things I've tried this workflow on: office, storage shed, home, record collection (works quite well).

4. This was my entry to a Kaggle competition to showcase the Gemini Long Context window.

All code + prompts + data is available.

You could replicate this similar workflow by changing the Gemini API key + the input video (or try it with my video).

• Code + prompts: kaggle.com/code/mrdbourke…
• Data: kaggle.com/datasets/mrdbo…
• Original competition: kaggle.com/competitions/g…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @mrdbourke

Daniel Bourke

@mrdbourke

Dec 10, 2021

If machine learning projects were a relationship...

Data collecting and processing is the dating phase, fun, chaotic, up and down, tormenting and carefree, seeing if you're a good fit.

Modelling is the wedding day, takes forever to plan, over before you know it.

People using your model is the honeymoon.

Then comes the data drift.

Your data changes like the person you thought you married, maybe they're getting fat (distribution changes) or they're finding it hard to love you (your data features are no longer ideal).

So you bring in data monitoring, model evaluation (marriage counselling) and pull all the tricks.

Your marriage counsellor tells you to go back to what got you started.

The fun dates (collecting data), talking for hours learning about each other (processing data).

Read 7 tweets

Daniel Bourke

@mrdbourke

Jul 6, 2021

Friends, the latest issue(s) of Machine Learning Monthly (June 2021) have been delivered beautifully to your inbox/YouTube subscription page.

Quick takes below.

1. Text-based Video Editing

This is wild.

Not too often things make me go "Woah".

This did.

I'd never even thought of this kind of editing.

Chop and change videos using pure text transcriptions.

• Project page: davidyao.me/projects/text2…

@chipro

2. Cracking the Machine Learning Interview

The wonderful @chipro has done it again with an outstanding book on the machine learning interview process.

Getting a job is a skill in itself.

Lucky you've now got Chip's book to help you.

• Read it free: huyenchip.com/ml-interviews-…

Read 5 tweets

Daniel Bourke

@mrdbourke

Jun 8, 2021

Machine Learning Monthly for May 2021 is live (video & audio)!

The latest and greatest (but not always the latest) from the machine learning world in the past month + plenty of dancing.

This month we've got...

@TensorFlow

Huuuuuuge updates to @TensorFlow:

• TensorFlow Lite models now work with TensorFlow.js (train once, deploy twice)

• Google's on-device machine learning page tailors ML guides for your smaller device needs

• TF Lite model maker library helps you train on-device models faster

• TensorFlow Hub gets a facelift, plus, now you can try pretrained models before you buy them (jk the models are free)

• TensorFlow Cloud library helps you scale up your smaller experiments to cloud-scale in a few lines of code (e.g. Google Colab -> 8 GPUs)

Read 10 tweets

Daniel Bourke

@mrdbourke

Apr 27, 2021

@marksaroufim

Outstanding post by @marksaroufim

He put into words something I’ve been thinking but didn’t quite know why.

@huggingface & @weights_biases are two of my favourite ML companies.

Why?

Because like @fastdotai...

They create community.

https://twitter.com/marksaroufim/status/1386718728662884358

@roboflow

I’d also add @roboflow into the mix of my favourite up and coming ML companies.

People like people.

Roboflow are making things and sharing ideas directly from the engineers/founders.

It’s good to relate to the people behind the product.

Not to mention the memes throughout this post are worth their weight in gold.

This one describes perfectly describes my last 3 years online (except replace Twitter w/ YouTube).

Note for myself going forward: leverage product off media/community base.

Read 5 tweets

Daniel Bourke

@mrdbourke

Jul 31, 2020

1/ Introducing the 2020 #machinelearning roadmap:

An interactive mindmap which connects many (not all) of the most important concepts in machine learning.

Map: dbourke.link/mlmap
Video walkthrough:
Accompanying slides: github.com/mrdbourke/mach…

2/ In the map you'll find 5 branches:

1. 🤔 Problems - some of the main use cases for ML.
2. ♻️ Process - what does a solution look like?
3. 🛠 Tools - how can you build your solution?
4. 🧮 Math - ML is applied mathematics, what kind?
5. 📚 Resources - where to learn the above.

3/ Although very colorful, at first glance, the map can be very intimidating.

So there's a video walkthrough to go along with it:

We start with a high level overview which answers questions like "what is machine learning good for?"

Read 13 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Daniel Bourke

Try unrolling a thread yourself!

More from @mrdbourke

Daniel Bourke

Daniel Bourke

Daniel Bourke

Daniel Bourke

Daniel Bourke

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!