Tweet

Tivadar Danka

8 Apr, 10 tweets, 3 min read

One of my favorite convolutional network architectures is the U-Net.

It solves a hard problem in such an elegant way that it became one of the most performant and popular choices for semantic segmentation tasks.

How does it work?

🧵 👇🏽

Let's quickly recap what semantic segmentation is: a common computer vision task, where we want to classify which class each pixel belongs to.

Because we want to provide a prediction on a pixel level, this task is much harder than classification.

Since the absolutely classic paper Fully Convolutional Networks for Semantic Segmentation by Jonathan Long, Evan Shelhamer, and Trevor Darrell, fully end-to-end autoencoder architectures were most commonly used for this.

(Image source: paper above, arxiv.org/abs/1411.4038v2)

One of the huge advantages of the fully convolutional architecture is that it eliminates the need for hand-engineering post-processing.

Due to the end-to-end training, post-processing is learned!

However, this is not without new complications.

These networks first downsample the image, learning a feature representation. This feature representation is then upsampled to predict class labels per pixel.

There is a huge problem: information is lost during downsampling. Deeper architecture means more information loss.

In certain fields, this is a big issue.

For instance, in cell microscopy, cells can grow really close to each other, even as close as 1-2 pixels. Downsampling destroys these small margins.

This is demonstrated in the U-Net paper, as you can see below.

In their paper, Olaf Ronneberger, Philipp Fischer, and Thomas Brox introduce U-Net to solve the problem (arxiv.org/abs/1505.04597).

The solution is elegant and simple: save the downsampling layers' input, then feed them back during the corresponding upsampling step.

U-Net not only solved the information loss but knocked all other semantic segmentation architectures out of the park as well.

Even half a decade later, U-Net is often the go-to model for the task.

Personally, this is the first thing I try for a new dataset.

Its popularity is reflected by the 24842 citations to date, catapulting the paper into the machine learning hall of fame.

By the time this tweet is published, this number is probably going to increase.

Image sources.
1st image: Fully Convolutional Networks for Semantic Segmentation by Jonathan Long et al., arxiv.org/abs/1411.4038v2

The rest: U-Net: Convolutional Networks for Biomedical Image Segmentation by Olaf Ronneberger et al., arxiv.org/abs/1505.04597

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @TivadarDanka

Tivadar Danka

@TivadarDanka

7 Apr

There is a common misconception that all probability distributions are like a Gaussian.

Often, the reasoning involves the Central Limit Theorem.

This is not exactly right: they resemble Gaussian only from a certain perspective.

🧵 👇🏽

Let's state the CLT first. If we have 𝑋₁, 𝑋₂, ..., 𝑋ₙ independent and identically distributed random variables, their scaled sum is a Gaussian distribution in the limit.

The surprising thing here is the limit is independent of the variables' distribution.

Note that the random variables undergo a significant transformation: averaging and scaling with the mean, the variance, and √𝑛.

(The scaling transformation is the "certain perspective" I mentioned in the first tweet.)

Read 12 tweets

Tivadar Danka

@TivadarDanka

1 Apr

Gradient descent sounds good on paper, but there is a big issue in practice.

For complex functions like training losses for neural networks, calculating the gradient is computationally very expensive.

What makes it possible? For one, stochastic gradient descent!

🧵 👇🏽

When you have a lot of data, calculating the gradient of the loss involves the computation of a large sum.

Think about it: if 𝑥ᵢ denotes the data and 𝑤 denotes the weights, the loss function takes the form below.

Not only do we have to add a bunch of numbers together, but we have to find the gradient of each loss term.

For example, if the model contains 10 000 parameters and 1 000 000 data points, we need to compute 10 000 x 1 000 000 = 10¹⁰ derivatives.

This can take a LOT of time.

Read 7 tweets

Tivadar Danka

@TivadarDanka

31 Mar

Gradient descent has a really simple and intuitive explanation.

The algorithm is easy to understand once you realize that it is basically hill climbing with a really simple strategy.

Let's see how it works!

🧵 👇🏽

For functions of one variable, the gradient is simply the derivative of the function.

The derivative expresses the slope of the function's tangent plane, but it can also be viewed as a one-dimensional vector!

When the function is increasing, the derivative is positive. When decreasing, it is negative.

Translating this to the language of vectors, it means that the "gradient" points to the direction of the increase!

This is the key to understand gradient descent.

Read 9 tweets

Tivadar Danka

@TivadarDanka

27 Mar

Sigmoid is one of most commonly used activation functions.

However, it has a serious weakness: Sigmoids often make the gradient disappear.

This can leave the network stuck during training, so they effectively stop learning.

How can this happen?

🧵 👇🏽

Let's take a look at the Sigmoid function first.

Notice below that as 𝑥 tends to ∞ or -∞, the function flattens out.

This can happen for instance when the previous layer separates the classes very well, mapping them far away from each other in the feature space.

Why is this a problem?

Flatness means that the derivative is close to zero, as shown in the figure below.

Read 7 tweets

Tivadar Danka

@TivadarDanka

22 Mar

What if you want to optimize a function, but every evaluation costs you $100 and takes a day to execute?

Algorithms like gradient descent build on two key assumptions:

• function is differentiable,
• and you can calculate it on demand.

What if this is not the case?

🧵 👇🏽

For example, you want to tune the hyperparameters of a model that requires 24 hours of GPU time to train.

Can you find a good enough value under reasonable time and budget?

One method is the so-called Bayesian optimization.

Essentially, the method works as follows.

1️⃣ Model the expensive function with a Gaussian process.

Gaussian processes are easy to compute and offer a way to quantify uncertainty in the predictions.

Read 14 tweets

Tivadar Danka

@TivadarDanka

15 Mar

Building a good training dataset is harder than you think.

For example, you can have millions of unlabelled data points, but only have the resources to label a thousand.

This is a story is about a case that I used to encounter almost every day in my work.

🧵 👇🏽

Do you know how new drugs are developed?

Essentially, thousands of candidate molecules are tested to see if they have the targeted effect. First, the testing is done on cell cultures.

Sometimes, there is no better option than scanning through libraries of molecules.

After cells are treated with a given molecule (or molecules in some cases), the effects are studied by screening them with microscopy.

The treated cells can exhibit hundreds of different phenotypes ( = classes), some of them might be very rare.

Read 12 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!