Tweet

Tivadar Danka

22 Mar, 14 tweets, 3 min read

What if you want to optimize a function, but every evaluation costs you $100 and takes a day to execute?

Algorithms like gradient descent build on two key assumptions:

• function is differentiable,
• and you can calculate it on demand.

What if this is not the case?

🧵 👇🏽

For example, you want to tune the hyperparameters of a model that requires 24 hours of GPU time to train.

Can you find a good enough value under reasonable time and budget?

One method is the so-called Bayesian optimization.

Essentially, the method works as follows.

1️⃣ Model the expensive function with a Gaussian process.

Gaussian processes are easy to compute and offer a way to quantify uncertainty in the predictions.

2️⃣ Estimate the information gain of evaluating the function at each unknown value.

3️⃣ Evaluate the function where this gain is maximal. Go back to the first step until the budget is exhausted.

The process looks simple enough.

There are multiple ways to estimate the potential information gain, but we will take a look at the simplest: the probability of improvement.

The formula looks scary, so let me explain!

First, μ(𝑥) and σ(𝑥) are the mean and variance of the Gaussian process used to estimate our unknown function 𝑓.

Our currently known best optimum is at 𝑥⁺. We wish to improve this.

Finally, ψ(𝑥) is the cumulative distribution function for a standard Gaussian distribution.

How do they add up?

Let's start from the outside in. We want to maximize the probability of improvement.

For this, we estimate the improvement with what is on the inside, then put it into the CDF of a Gaussian distribution with mean 0 and variance 1.

(Note that from an optimization perspective, ψ(𝑥) can be omitted, since it is increasing. Though, from a theoretical perspective, it is important.)

On the inside, we calculate that for a given unknown value 𝑥, how much is our estimated improvement, based on the Gaussian process modeling our knowledge.

μ(𝑥) is what we think the function looks like, 𝑓(𝑥⁺) is the current optimum.

So, μ(𝑥) - 𝑓(𝑥⁺) is the improvement.

Since our knowledge about 𝑓 is very incomplete, we have to take uncertainties into account.

This is where the σ(𝑥) in the denominator comes into play: it expresses the uncertainty.

The lower it is, the more we can trust that μ(𝑥) - 𝑓(𝑥⁺) estimates improvement well.

The last piece of the puzzle is the parameter ξ in the numerator.

It encourages Bayesian estimation to experiment and explore new areas where our Gaussian process doesn't necessarily predict an improvement.

ξ is called the tradeoff between exploration and exploitation.

Without ξ, the Bayesian optimization process would be stuck in the same area, exploring a single local optimum.

This is suboptimal because our incomplete information based on a few known values of the function can introduce a severe bias to the process.

This is how a few queries look like.

On the top, you can see the function to be optimized and the Gaussian process. On the bottom, the probability of improvement is shown at each step.

Of course, these are just the very fundamentals of the topic. If you are interested in more, here is an awesome introductory article by Eric Brochu, Vlad M. Cora, and Nando de Freitas!

arxiv.org/abs/1012.2599

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @TivadarDanka

Tivadar Danka

@TivadarDanka

15 Mar

Building a good training dataset is harder than you think.

For example, you can have millions of unlabelled data points, but only have the resources to label a thousand.

This is a story is about a case that I used to encounter almost every day in my work.

🧵 👇🏽

Do you know how new drugs are developed?

Essentially, thousands of candidate molecules are tested to see if they have the targeted effect. First, the testing is done on cell cultures.

Sometimes, there is no better option than scanning through libraries of molecules.

After cells are treated with a given molecule (or molecules in some cases), the effects are studied by screening them with microscopy.

The treated cells can exhibit hundreds of different phenotypes ( = classes), some of them might be very rare.

Read 12 tweets

Tivadar Danka

@TivadarDanka

8 Mar

A neural network doesn't know when it doesn't know.

If you think about it, recognizing when a data point is absolutely unlike any other previously seen is a problem rarely dealt with.

However, it is essential.

In this thread, I'll explain how and why!

🧵 👇🏽

Suppose that this is your training data.

The situation looks fairly straightforward: a simple logistic regression solves the problem.

The model is deployed to production without a second thought.

Now comes the surprise!

We start receiving new data for prediction when we see the following pattern emerge.

The new instances are confidently classified, incorrectly.

Read 11 tweets

Tivadar Danka

@TivadarDanka

4 Mar

Mistakes should be celebrated.

I used to struggle with everything I started to do until I became skilled in it.

The key was to recognizing what I did wrong and going back to fix it. Over and over and over again.

Here is my list of failures that led me to success!

🧵 👇🏽

I was a bad student in school. The most difficult subject for me was mathematics, which I almost failed at one time.

Once I developed an interest, I started to improve very slowly.

Years later, I obtained a PhD in it after solving a problem that has been unsolved for decades.

As a teenager, I was overweight and physically weak. All fat, no muscle.

I was unable to do a single pushup.

Years later, I regularly do 25-50 pushups with one arm only. (Learning to do just a single one-armed pushup took me five years.)

Read 7 tweets

Tivadar Danka

@TivadarDanka

3 Mar

I am going to tell you the best-kept secret of linear algebra: matrices are graphs and graphs are matrices.

Encoding matrices as graphs is a cheat code, making complex behavior extremely simple to study.

Let me show you how!

🧵 👇🏽

If you looked at the example above, you probably figured out the rule.

Each row is a node, and each element of a row represents a directed edge.

The element in the 𝑖-th row, 𝑗-th column corresponds to the edge in the graph, going from 𝑖 to 𝑗.

(Formal definition below.)

Why is the directed graph representation beneficial for us?

The first example is that the powers of the matrix correspond to walks in the graph.

Take a look at how to calculate the elements of the square of a matrix.

Read 9 tweets

Tivadar Danka

@TivadarDanka

2 Mar

Besides Kaggle, there are several other competition platforms.

You can use these to

• learn,
• test your skills,
• collaborate with awesome people,
• enhance your resume,
• and possibly earn money.

Take a look at these below, you'll definitely find them useful!

🧵 👇🏽

1. Numerai (numer.ai)

This is quite a special one, since it only contains a single competition.

However, its aims are big: Numerai wants to build the world's first open hedge fund

2. AIcrowd (aicrowd.com)

You can find all sort of competitions here on a wide spectrum, from applied problems to research.

Read 16 tweets

Tivadar Danka

@TivadarDanka

1 Mar

You ask me so often for free online resources about deep learning that I decided to collect my favorite courses!

These topics interest you the most:

🟩 practical deep learning,
🟩 deep learning theory,
🟩 math resources to understand the two above.

Let's see them!

🧵 👇🏽

1️⃣ Practical deep learning.

If you want to take a deep dive straight into the field and want to start training your models right away, hands down the best course for you out there is Practical Deep Learning for Coders by fast.ai. (course.fast.ai)

@full_stack_dl

To move beyond training models and learn about tooling and infrastructure, IMO the best course for you is the Full Stack Deep Learning course by @full_stack_dl.

fall2019.fullstackdeeplearning.com

Read 13 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!