Santiago Profile picture
Feb 9, 2021 27 tweets 7 min read Read on X
Seriously though, how the heck can a computer recognize what's in an image?

Grab a coffee ☕️, and let's talk about one of the core ideas that makes this possible.

(I'll try to stay away from the math, I promise.)

👇
If you are a developer, spend a few minutes trying to think about a way to solve this problem:

→ Given an image, you want to build a function that determines whether it shows a person's face.

2/ Image
It gets overwhelming fast, right?

What are you going to do with all of these pixels?

3/
Alright, you get the idea: this is a hard problem to solve, and we can't just develop our way out of it.

So let's talk about machine learning.

More specifically, let's talk about Convolutional Neural Networks.

4/
Well, I'm skipping like 300 layers of complexity here.

We should start talking about neural networks and build from that idea, but that'll be boring, and I'm sure you've heard of them before.

If you want a refresher, here is an amazing video:

5/ Image
Fully connected networks are cool, but convolutional layers transformed the field.

I want to focus on them, so next time somebody mentions "convolution," you know exactly what's going on.

6/
Before getting too technical, let's try to break down the problem in a way that makes the solution a little bit more intuitive.

Understanding an image's contents is not about individual pixels but about the patterns formed by nearby pixels.

7/
For instance, think about Lena's picture attached here.

You get a bunch of pixels that together form the left eye. Another bunch that makes up the right eye. You have the nose, mouth, eyebrows, etc.

Put them together, and you get her face.

8/ Image
Wave your magic wand and imagine you could build a function specializing in detecting each part of the face.

In the end, you run every function, and if you can find every piece, you would flag the image as being a face.

Easy, right?

9/ Image
But, how do we find an eye on a picture?

Well, we could keep breaking the problem into smaller pieces.

There are lines, circles, colors, patterns that together make up an eye. We could build more functions that detect each one of those separately.

10/
See where I'm going here?

We could build hundreds of functions, each one specializing in a specific portion of the face. Then have them look at the entire picture.

We can then put them together like a giant puzzle to determine whether we are looking at a face.

🙃

11/
I'm happy with that idea because I think it makes sense!

But building hundreds of little functions looking for individual patterns in an image is still a huge hurdle.

😬

Where do we start?

12/
Enter the idea of a "filter," a small square matrix that we will move across the image from top left to bottom right.

Every time we do this, we compute a value using a "convolution" operation.

13/ Image
Look at this picture.

A convolution operation is a dot product (element-wise multiplication) between the filter and the input image patch. Then the result is summed to result in a single value.

After doing this, we move the filter over one position and do it again.

14/ Image
Here is the first convolution operation.

It produces a single value (0.2)

After doing this, we will convolve the filter with the next patch from the image and repeat this until we cover the whole picture.

Ok, this is as much math as I want you to endure.

15/ Image
Here's what's cool about this: convolving an image with different filters will produce different outputs!

The attached code uses the filter2d() function from OpenCV to convolve an image with two different filters.

Code: gist.github.com/svpino/be7ba9b…

16/ Image
Look at the results here.

Notice how one of the pictures shows all the horizontal edges, while the other only shows the vertical edges.

Pretty cool, huh?

17/ Image
Even better: since we are convolving each filter with the entire input image, we can detect features regardless of where they are located!

This is a crucial characteristic of Convolutional Neural Networks. Smart people call it "translation invariance."

18/
Quick summary so far:

▫️ We have a bunch of filters
▫️ Each one worries about a specific pattern
▫️ We convolve them with the input image
▫️ They can detect patterns wherever they are

Do you see where this is going?

19/
The functions that we talked about before are just different filters that highlight different patterns from our image!

We can then combine each filter to find larger patterns to uncover whether we have a face.

20/
One more thing: how do we come up with the values that we need for each filter?

Horizontal and vertical edges aren't a big deal, but we will need much more than that to solve our problem.

21/
Here is where the magic happens!

Our network will learn the value of the filters during training!

We'll show it many faces, and the network will come up with useful filters that will help detect faces.

🤯

22/
None of this would be possible without everything you already know about neural networks.

I also didn't talk about other operations that make Convolutional Networks work.

But hopefully, this thread highlights the main idea: convolutions rock!

23/
If you enjoy my attempts to make machine learning a little more intuitive, stay tuned and check out @svpino for more of these threads.
There's no way to tell what specific features the filters will learn.

The expectation is that they'll focus on the face but they may learn useless features as well.

Hence the importance of validating the results and properly curating the dataset.

Great question!

In this particular case, the resultant images have the same dimensions because filter2d() uses cv2.BORDER_DEFAULT to replicate the border.

But you are right: the result of a pure convolution operation will give us smaller dimensions.

Speaking about patterns and generalization, here is the natural continuation of this thread:

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Santiago

Santiago Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @svpino

Sep 16
How can you build a good understanding of math for machine learning?

Here is a complete roadmap for you.

In essence, three fields make this up:

• Calculus
• Linear algebra
• Probability theory

Let's take a quick look at them! Image
This thread is courtesy of @TivadarDanka.

3 years ago, he started writing a book about the mathematics of Machine Learning.

It's the best book you'll ever read:



Nobody explains complex ideas like he does.tivadardanka.com/books/mathemat…
1. Linear algebra.

In machine learning, data is represented by vectors. Essentially, training a learning algorithm is finding more descriptive representations of data through a series of transformations.

Linear algebra is the study of vector spaces and their transformations. Image
Read 9 tweets
Aug 12
The single most undervalued fact of linear algebra:

Matrices are graphs, and graphs are matrices.

Encoding matrices as graphs is a cheat code, making complex behavior simple to study.

Let me show you how! Image
By the way, this thread is courtesy of @TivadarDanka. He allowed me to republish it.

3 years ago, he started writing a book about the mathematics of Machine Learning.

It's the best book you'll ever read:



Nobody explains complex ideas like he does.tivadardanka.com/books/mathemat…
If you look at this example, you probably figured out the rule.

Each row is a node, and each element represents a directed and weighted edge. We omit any edges of zero elements.

The element in the 𝑖-th row and 𝑗-th column corresponds to an edge going from 𝑖 to 𝑗. Image
Read 18 tweets
Jul 12
A common fallacy:

If it's raining, the sidewalk is wet. But if the sidewalk is wet, is it raining?

Reversing the implication is called "affirming the consequent." We usually fall for this.

But surprisingly, it's not entirely wrong!

Let's explain it using Bayes Theorem:

1/10 Image
This explanation is courtesy of @TivadarDanka. He allowed me to republish it.

He is writing a book about the mathematics of Machine Learning. It's the best book I've read:



Nobody explains complex ideas like he does.

2/10tivadardanka.com/books/mathemat…
We call propositions of the form "if A, then B" implications.

We write them as "A → B," and they form the bulk of our scientific knowledge.

For example:

"If X is a closed system, then the entropy of X cannot decrease" is the second law of thermodynamics.

3/10
Read 10 tweets
Jun 12
Some of the skills you need to start building AI applications:

• Python and SQL
• Transformer and diffusion models
• LLMs and fine-tuning
• Retrieval Augmented Generation
• Vector databases

Here is one of the most comprehensive programs that you'll find online:
"Generative AI for Software Developers" is a 4-month online course.

It's a 5 to 10-hour weekly commitment, but you can dedicate as much time as you want to finish early.

Here is the link to the program:

I also have a PDF with the syllabus:bit.ly/4aNOJdy


I'm a huge fan of online education, but most of it is all over the place and mostly theoretical.

This program is different:

You'll work on 4 different hands-on projects. You'll learn practical skills you can use at the office right away.cdn.sanity.io/files/tlr8oxjg…
Read 6 tweets
Jun 10
There's a stunning, simple explanation behind matrix multiplication.

This is the first time this clicked on my brain, and it will be the best thing you read all week.

Here is a breakdown of the most crucial idea behind modern machine learning:

1/15 Image
This explanation is courtesy of @TivadarDanka. He allowed me to republish it

3 years ago, he started writing a book about the mathematics of Machine Learning.

It's the best book you'll ever read:



Nobody explains complex ideas like he does.

2/15tivadardanka.com/books/mathemat…
Let's start with the raw definition of the product of A and B.

This looks horrible and complicated.

Let's unwrap it step by step.

3/15 Image
Read 15 tweets
May 28
This assistant has 169 lines of code:

• Gemini Flash
• OpenAI Whisper
• OpenAI TTS API
• OpenCV

GPT-4o is slower than Flash, more expensive, chatty, and very stubborn (it doesn't like to stick to my prompts).

Next week, I'll post a step-by-step video on how to build this.
The first request takes longer (warming up), but things work faster from that point.

Few opportunities to improve this:

1. Stream answers from the model (instead of waiting for the full answer.)

2. Add the ability to interrupt the assistant.

3. Whisper running on GPU
Unfortunately, no local modal supports text+images (as far as I know,) so I'm stuck running online models.

The TTS API (synthesizing text to audio) can also be replaced by a local version. I tried, but the available voices suck (too robotic), so I kept OpenAI's.
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(