Topic -- Outlier Detection & Removal using Z-score Method
A Thread 🧵
The Z-score method is statistical approach used for detecting & removing outlier in dataset. An outlier is observation that lies far away from other observation in dataset. Such observations can significantly affect statistical properties of dataset & lead to erroneous conclusion
Approach for Outliers
- The very first step will be setting the upper and lower limit
- The first technique for dealing with outliers is trimming & this is regardless of what kind of data distribution you are working with, trimming is an applicable and proven technique for most
Capping is another technique for dealing with bad data points; it is useful when we have many outliers, and removing a good amount of data from the dataset is not good
Limitations of Z-Score
we mean that it only works with the data which is completely or close to normally distributed, which in turn stimulates that this method is not for skewed data, either left skew or right skew. For the other data, we have something known as (IQR) method
It is important to note that outlier removal can significantly affect the statistical properties of the dataset, and should be done with caution.
🎯Are NULL values same as that of zero or a blank space❓
🔺A NULL value is not at all same as that of zero or a blank space.
🔺NULL value represents a value which is unavailable, unknown, assigned or not applicable whereas a zero is a number and blank space is a character.
🎯What is the usage of the NVL() function❓
🔹Answer
🔺You may use NVL function to replace null values with a default value. 🔺The function returns the value of second parameter if first parameter is null.
🔺If the first parameter is anything other than null, it is left alone
Topic - Handling Mixed Variable in Feature Engineering 👨💻
A Thread 🧵
Handling missing Variable is very important as many machine learning algorithms do not support data with missing values. If you have missing values in the dataset, it can cause errors and poor performance with some machine learning algorithms.
Variable deletion involves dropping variables (columns) with missing values on a case-by-case basis. This method makes sense when there are a lot of missing values in a variable and if the variable is of relatively less importance.