An exciting result came out from @GoogleAI recently, which raises several questions about how deep network architectures should be.

Here is their announcement, including a very interesting post. I would like to unpack this a bit.

Suppose that you have a trained network and a set of samples 𝑋. You take this data and run it through the network, storing all intermediate results.

The output of the 𝑖-th layer is denoted by 𝑋ᵢ. These encode the intermediate internal representations of the data.
In general, the further you go, the higher level these representations become.

For a convolutional network, filters in earlier layers detect edges, while later activations represent objects.

Check the fantastic article below for more details!

distill.pub/2017/feature-v…
Each layer should learn something new, isn't it?

According to the latest results by Google AI (referenced in the first tweet), this is not the case!
There are quantitative measurements that can tell if representations 𝑋ᵢ and 𝑋ⱼ are similar. In this particular paper, the Centered Kernel Alignment was used. (CKA paper: arxiv.org/pdf/1905.00414…)
It turns out that the representations 𝑋ᵢ for many consecutive layers are often similar. Visualizing the CKA similarities on a heatmap, a block structure emerges.

This means that there are large sections that are doing nothing.

(Image source: arxiv.org/pdf/2010.15327…) Image
This observation raises several interesting questions, like

• Is the emergence of block structure dependent on the dataset?
• Can redundant layers be removed?

These and many more are answered in the aforementioned paper by Thao Nguyen, Maithra Raghu, and Simon Kornblith.
IMO this is an extremely important research area. Since the size of networks can be absolutely crazy (see GPT-3), reducing them can make a significant impact in the long run, both on the applicability of the technology and on its environmental footprint.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tivadar Danka

Tivadar Danka Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @TivadarDanka

7 May
One of the biggest misconceptions regarding education is that its main purpose is to give knowledge you can immediately use.

It is not.

The best thing education can give you is the mental agility to obtain knowledge at the speed of light.

Let's unpack this idea a bit!

1/7
Consider a course where you build a custom neural network framework with NumPy.

This is hardly usable in practice: working with a custom library is insane.

However, if you know how they are built, you only need to learn the interface to master an actual framework!

2/7
By understanding how the framework is built and how the underlying algorithms work, you'll be able to do much more: experiment with custom optimizers, implement your own layers, etc.

3/7
Read 7 tweets
28 Apr
Principal Component Analysis is one of the most fundamental techniques in data science.

Despite its simplicity, it has several equivalent forms that you might not have seen.

In this thread, we'll explore what PCA is really doing!

🧵 👇🏽
PCA is most commonly introduced as an algorithm that iteratively finds vectors in the feature space that are

• orthogonal to the previously identified vectors,
• and maximizes the variance of the data projected onto it.

These vectors are called the principal components.
The idea behind this is we want features that convey as much information as possible.

Low variance means that the feature is more concentrated, so it is easier to predict its value in principle.

Features with low enough variances can even be omitted.
Read 10 tweets
27 Apr
Have you ever wondered why include the logarithm in the definition of log-likelihood?

The answer is simple: logarithm makes differentiation of products easier.

Let's see why!

🧵 👇🏽
Although the derivative of a sum is the sum of derivatives, a similar property cannot be stated about the product of functions.

The derivative of a product is slightly more complicated: it is a sum of products.
The formula gets even more complicated when we have more functions in the product.

When potentially hundreds of terms are present, like in the likelihood function, computing this is not feasible.
Read 6 tweets
26 Apr
Machine learning has enabled scientific breakthroughs in several fields.

Biotechnology is one of the most fascinating, as researchers could perform mindblowing tasks with the new tools.

Here are my favorite problems that machine learning helps to solve!

🧵 👇🏽
These are the topics we are going to talk about:

1. Predicting protein structure from amino acid sequences.
2. Accelerating high-throughput screening for drug discovery.
3. Mapping out the human cell atlas.
4. Precision medicine.

Let's dive in!
1. Predicting protein structure from amino acid sequences.

Proteins are the workhorses of biology. In our body, myriads of processes are controlled by proteins. They enable life. Yet compared to their importance, we know so little about them!
Read 15 tweets
19 Apr
Softmax is one of the most commonly used functions in machine learning.

It is used to transform high-level features into probabilities. Based on the formula, it is hard to imagine how it is done exactly.

Softmax might not be what you think it is. Let's find out why!

🧵 👇🏽
First, we start with the exponential function eˣ, which transforms a real number into a positive one.

It has a feature that shows the geometry of this transformation: it turns addition into multiplication.

In particular, eᵃ ⁺ ᵇ = eᵃ eᵇ holds.
The input x = (x₁, x₂, ..., xₙ) consists of the highest level features: the class scores.

For two vectors x and y, xᵢ - yᵢ expresses the difference between features.

After the exponential function, this is transformed into their ratio.
Read 10 tweets
16 Apr
In the last 24 hours, more than 400 of you decided to follow me. Thank you, I am honored!

As you probably know, I love explaining complex machine learning concepts simply. I have collected some of my past threads for you to make sure you don't miss out on them.

Enjoy!
Read 16 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(