Using more features from your data never comes for free.
Let's talk about dimensionality.
↓
2. Two days ago I asked this question.
Let's now analyze each option starting with Option 3 (probably the easiest one we can discard.)
3. Option 3 states that when we cut down the number of features, we need to "make up the difference" by adding more data.
Removing features reduces the number of dimensions in our data.
It concentrates the samples we have in a lower-dimensional space.
4. We can't replace the information provided by a feature with more data.
Cutting down a feature might make it harder for an algorithm to learn our data, but adding more samples won't necessarily solve that.
Option 3 is not a valid answer.
5. There are three choices left, and we can find the correct answer using the same insight.
Let's do a mental experiment: Imagine graphing a set of numbers.
Since you have only one dimension, they will all lie somewhere in a line.
6. Don't add any new values, but increase the features by adding a second dimension.
Now your values became a set of 2D coordinates (x, y).
If you graph them, they will all be somewhere in a plane.
7. If you compare the 1D line with the 2D plane (or even a 3D space assuming you add a third dimension,) something will become apparent quick:
As we increase the dimensionality of the data, it will be harder and harder to fill up the space with the same points.
8. This increase in sparsity will make it much harder for the learning algorithm to find any interesting patterns.
How can we separate the data with too many dimensions but few samples?
9. Based on this, we are ready to make two statements:
1. There's a relationship between features and samples.
2. The more features we add, the more samples we need.
10. Option 4 is not correct because it violates our first statement above. Option 2 states the opposite of the second statement, so it is also not correct.
Option 1 is the correct solution to this question.
11. For a more formal definition, look at the Curse of Dimensionality:
The amount of data needed to extract any relevant information increases exponentially with the number of features in your dataset.
Are you into machine learning?
I write practical tips, break down complex concepts, and regularly publish short quizzes to keep you on your toes.
The complexity of turning a Jupyter notebook into a production system is frequently underestimated.
Having a model that performs great on a test set is not the end of the road but just the beginning.
Fortunately, there's something for you here!
↓
2. The productionization of machine learning systems is one of the most critical topics in the industry today.
There's been a lot of progress, and it's getting better, but for the most part, we are just at the beginning of this road.
3. Not only the space is still immature, but it's very fragmented.
Talk to three different teams, and it's very likely they all use different tools, processes, and focus on different aspects of the lifecycle of their systems.
Many machine learning courses that target developers want you to start with algebra, calculus, probabilities, ML theory, and only then—if you haven't quit already—you may see some code.
I want you to know there's another way.
↓
2. For me, there's no substitute to seeing things working, trying them out myself, hitting a wall, fixing them, seeing the results.
A hands-on approach engages me in a way pages of theory never will.
And I know many of you reading this are wired just like me.
3. I feel that driving a car is a good analogy.
While understanding some basics are necessary to start driving, you don't need to read the entire manual before jumping behind the wheel.
As long as you practice in empty parking lots and backroads, you'll be fine.