Tweet

Santiago

Follow @svpino

Jan 16,, 12 tweets, 2 min read

Using more features from your data never comes for free.

Let's talk about dimensionality.

↓

2. Two days ago I asked this question.

Let's now analyze each option starting with Option 3 (probably the easiest one we can discard.)

3. Option 3 states that when we cut down the number of features, we need to "make up the difference" by adding more data.

Removing features reduces the number of dimensions in our data.

It concentrates the samples we have in a lower-dimensional space.

4. We can't replace the information provided by a feature with more data.

Cutting down a feature might make it harder for an algorithm to learn our data, but adding more samples won't necessarily solve that.

Option 3 is not a valid answer.

5. There are three choices left, and we can find the correct answer using the same insight.

Let's do a mental experiment: Imagine graphing a set of numbers.

Since you have only one dimension, they will all lie somewhere in a line.

6. Don't add any new values, but increase the features by adding a second dimension.

Now your values became a set of 2D coordinates (x, y).

If you graph them, they will all be somewhere in a plane.

7. If you compare the 1D line with the 2D plane (or even a 3D space assuming you add a third dimension,) something will become apparent quick:

As we increase the dimensionality of the data, it will be harder and harder to fill up the space with the same points.

8. This increase in sparsity will make it much harder for the learning algorithm to find any interesting patterns.

How can we separate the data with too many dimensions but few samples?

9. Based on this, we are ready to make two statements:

1. There's a relationship between features and samples.

2. The more features we add, the more samples we need.

10. Option 4 is not correct because it violates our first statement above. Option 2 states the opposite of the second statement, so it is also not correct.

Option 1 is the correct solution to this question.

11. For a more formal definition, look at the Curse of Dimensionality:

The amount of data needed to extract any relevant information increases exponentially with the number of features in your dataset.

@svpino

Are you into machine learning?

I write practical tips, break down complex concepts, and regularly publish short quizzes to keep you on your toes.

Follow me @svpino and let's do this together!

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @svpino

Santiago

@svpino

Jan 14,

The complexity of turning a Jupyter notebook into a production system is frequently underestimated.

Having a model that performs great on a test set is not the end of the road but just the beginning.

Fortunately, there's something for you here!

↓

2. The productionization of machine learning systems is one of the most critical topics in the industry today.

There's been a lot of progress, and it's getting better, but for the most part, we are just at the beginning of this road.

3. Not only the space is still immature, but it's very fragmented.

Talk to three different teams, and it's very likely they all use different tools, processes, and focus on different aspects of the lifecycle of their systems.

Read 7 tweets

Santiago

@svpino

Jan 11,

Many machine learning courses that target developers want you to start with algebra, calculus, probabilities, ML theory, and only then—if you haven't quit already—you may see some code.

I want you to know there's another way.

↓

2. For me, there's no substitute to seeing things working, trying them out myself, hitting a wall, fixing them, seeing the results.

A hands-on approach engages me in a way pages of theory never will.

And I know many of you reading this are wired just like me.

3. I feel that driving a car is a good analogy.

While understanding some basics are necessary to start driving, you don't need to read the entire manual before jumping behind the wheel.

As long as you practice in empty parking lots and backroads, you'll be fine.

Read 10 tweets

Santiago

@svpino

Jan 8,

Do you really understand AI?

Only 16% of adults in the United States got a passing grade in a survey created by the Allen Institute for Artificial Intelligence.

Here are the 5 most interesting questions.

Would you get them right?

↓

AI can translate sentences into another language at the level of a human translator.

AI technology can analyze chest X-Rays with equal or better accuracy than a resident-level radiologist.

Read 13 tweets

Santiago

@svpino

Dec 22, 2021

Two helpful metrics to evaluate a machine learning model: Sensitivity and Specificity.

Here is how they work: ↓

2. I'm mainly used to thinking about Precision and Recall, but these new metrics come in handy when working with a ROC curve.

They are less popular in the machine learning community but widely used in other fields.

3. Let's start with Sensitivity.

Sensitivity → True Positive Rate. The capacity of a model to identify positive samples.

Sensitivity = (TP) / TP + FN

This should look very familiar: Sensitivity and Recall are the same things!

Read 8 tweets

Santiago

@svpino

Dec 20, 2021

Two of the most significant problems you have to deal with when building machine learning models:

• Overfitting
• Underfitting

Here is a quick mental model to help identify when one of them is happening.

↓

2. Let's start with a quick and simple definition:

Overfitting happens when your model is too complex for your dataset.

For example, a very deep neural network trying to learn a few dozen samples with a couple features.

3. Underfitting happens when your model is too simple for your dataset.

For example, a linear regression model trying to learn a large dataset with hundreds of features.

Read 8 tweets

Santiago

@svpino

Dec 9, 2021

One of the most useful Python libraries that you can learn is Pandas.

Especially if you want to build some skills in the data engineering or machine learning space, Pandas is crucial.

Here is what you need to know to get started right away. ↓

2. Pandas is an open-source library to analyze and manipulate data.

Some people even consider it the most powerful library to deal with data in any language!

3. A common way to use Pandas:

First, load a CSV file or a database table as a Python object.

Then, filter the data, aggregate it in any way you'd like, and do pretty much whatever you can imagine.

It's really powerful.

Read 7 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Santiago

Try unrolling a thread yourself!

More from @svpino

Santiago

Santiago

Santiago

Santiago

Santiago

Santiago

Did Thread Reader help you today?

Like this author's thread?