I’m not saying you need to be an expert in advanced calculus to do machine learning…
BUT, there is a big difference between someone that does vs someone that does NOT have a good foundation in stats when it comes to getting & explaining business results.
My thought process back in the day was to obtain a great foundation in stats and machine learning at the same time.
So here’s what helped me. I read a ton of books.
Here are the 3 books that helped me learn data science the most...
1. R for Data Science (Wickham & Grolemund) r4ds.had.co.nz
Correlation is the skill that has singlehandedly benefitted me the most in my career.
In 3 minutes I'll demolish your confusion (and share strengths and weaknesses you might be missing).
Let's go:
1. Correlation:
Correlation is a statistical measure that describes the extent to which two variables change together. It can indicate whether and how strongly pairs of variables are related.
2. Types of correlation:
Several types of correlation are used in statistics to measure the strength and direction of the relationship between variables. The three most common types are Pearson, Spearman Rank, and Kendall's Tau. We'll focus on Pearson since that is what I use 95% of the time.
Understanding probability is essential in data science.
In 4 minutes, I'll demolish your confusion.
Let's go!
1. Statistical Distributions:
There are 100s of distributions to choose from when modeling data. Choices seem endless. Use this as a guide to simplify the choice.
2. Discrete Distributions:
Discrete distributions are used when the data can take on only specific, distinct values. These values are often integers, like the number of sales calls made or the number of customers that converted.
R-squared is one of the most commonly used metrics to measure performance.
But it took me 2 years to figure out mistakes that were killing my regression models.
In 2 minutes, I'll share how I fixed 2 years of mistakes (and made 50% more accurate models than my peers). Let's go:
1. R-squared (R2):
R2 is a statistical measure used in regression models that provides a measure of how well the observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.
2. Range (0 to 1):
R2 ranges from 0 to 1. A higher R2 value indicates a better fit between the prediction and the actual data. For example, an R2 value of 0.70 suggests that 70% of the variance in the dependent variable is predictable from the independent variable(s).