Day 2 of #30DaysOfKaggle:
Intermediate Machine Learning from Kaggle Learn.
Dealing with missing values.
Day 3 of #30DaysofKaggle:
Completed remaining lessons in Intermediate Machine Learning. Advanced to Rank #764 in the competition (Rank #7250 in Day 1). Now I feel so good about this.
With a few hours remaining, I'll take SQL next.
@BecomingDataSci
Day 4 of #30DaysOfKaggle:
Completed Intro to SQL from Kaggle Learn. I underestimated SQL, will have to give more time later to learn about queries.
@kaggle @rctatman @BecomingDataSci
Day 5 of #30DaysOfKaggle:
Brushed up my Pandas by completing Pandas Micro-Course from @kaggle Learn (thanks, @ResidentMario!).
Day 6 of #30DaysOfKaggle:
Data Visualization Micro-Course from @kaggle Learn (thanks @alexis_b_cook !).
Well, I guess brushing up is done. Tomorrow I will learn about Kaggle competition and choose one dataset for practice purpose.
@BecomingDataSci
Day 7 of #30DaysOfKaggle:
Learned about Kaggle competition and started practicing based on this very good article by @koehrsen_will :
blog.kaggle.com/2018/08/22/mac…
@BecomingDataSci
Day 8 of #30DaysOfKaggle:
Continued working on Home Credit Default dataset (it's tough, but I learned a lot about EDA from @koehrsen_will's work).
Explored Discussion forum, found that Data Science community is very supportive to newbies. Love it!
@BecomingDataSci @kaggle
Day 9 of #30DaysOfKaggle:
Completed practice w/ Home Credit Default Risk dataset.
Explored Discussion forum, especially on Getting Started section, many valuable advise there, and directed to this interesting article: linkedin.com/pulse/12-thing…
@BecomingDataSci @kaggle
#30DaysOfKaggle
Preparation stage is done. The remaining days will be for the real competition, that'll be tough. I may not get an excellent score by the end of 30 days, but I am very EXCITED because I believe I will learn a lot during the process.
@BecomingDataSci
Day 10 of #30DaysOfKaggle:
Browsed through the list of active competitions, reviewed each one and decided to join House Prices competition, considering my learning objectives and timeframe.
Built the framework for this task.
Learned about each feature in the dataset.
Day 11 of #30DaysOfKaggle:
EDA. EDA. EDA. This will take days, I believe.
Following the framework shared by @pmpmarcelino, I completed the first 2 steps:
Assessed each feature's importance.
Performed univariable study.
@BecomingDataSci @kaggle
kaggle.com/pmarcelino/com…
Day 12 of #30DaysOfKaggle:
Still in EDA/cleaning phase.
Completed multivariate analysis.
Almost completed dealing with missing values (the server broke down just before I write imputation code block, which is the final step). Maybe it's time for me to take early sleep.
Day 13 of #30DaysOfKaggle:
As predicted, still in EDA phase. I wanted to move fast, build a model and tune the parameters, but I felt a need to really understand the data first.
Another thing: I have to learn more about data visualization coding.
Day 14 of #30DaysOfKaggle:
EDA is done, at least for now.
I will take course specialized on EDA soon. There are many techniques I don't know, along with their visualization code. Anyone can help me with good resources to learn?
@BecomingDataSci
Day 15 of #30DaysOfKaggle:
Started building a model.
Could not get the shape right.
Stuck.
Call it a day, will try again during the weekend, before looking up tutorial (hopefully I don't have to).
@StackOverflow will help.
@BecomingDataSci
Day 16 of #30DaysOfKaggle:
Reworked the dataset and got the shape right.
Built the first model and submitted the prediction as my first entry, to see where I would be with simple modeling.
Ranked #2960/4281. I still have 14 days to improve.
Day 17 of #30DaysOfKaggle:
Performed feature engineering to improve accuracy and performance.
Failed.
Found the error, finally.
Looked again at this work from afar, aware that I have to do this more systematically. Next, I will plan ahead before jumping into writing code.
Day 18 of #30DaysOfKaggle:
While trying to correct the error I found before, I found out that my data cleaning has a significant mistake that must be corrected. So, more rework. It's still a progress, though. It makes me know the data better, also improves data quality.
Day 19 of #30DaysOfKaggle:
Data cleaning rework done.
Experiment: included only 11 features (out of 77) which I deemed very important based on EDA.
Result: commit time improved from 293s to 42s, but the score worsen from 0.156 to 0.185.
Current rank: #3065/4455.
Day 20 of #30DaysOfKaggle:
Reworked imputation step, ensured each feature treated appropriately.
Dropped only 3 features.
Result: score 0.157, only a bit worse than my best (0.156), but much better performance (from 293s to 53s).
Current rank: #3046/4418.
Day 21 of #30DaysOfKaggle:
Read the original paper on the dataset. I should have read this before. It gave more perspective, also made me more appreciative on how people contribute to learning community by sharing their data sets.
@kaggle
Day 22 of #30DaysOfKaggle:
Added Feature Engineering steps.
Log transformed skewed features to improve normality.
Thanks to @apapiu for the guidance.
Result: RMSE score improved from 0.156 to 0.126.
Rank: advanced 1486 places to #1559/4400!!
Day 23 of #30DaysOfKaggle:
Improve Feature Engineering.
Play around with different regression models.
Result: score slightly improved to 0.123
Rank: advanced to #1375/4391.
It could be better. I should learn about Linear Regression again.
Day 24 of #30DaysOfKaggle:
Relearned Linear Regression and explored the @scikit_learn documentation.
Found and watch awesome Youtube videos on that topic (and others) by @joshuastarmer. The concepts become easy to grasp.
Check out this link:
statquest.org
Day 25 of #30DaysOfKaggle:
Learned about Gradient Boosting, again started from @joshuastarmer's video.
Visited my notebook again, played with Ridge and Lasso parameters.
Resubmitted, failed to get better accuracy.
Tried XGBoost, also failed.
Call it a day.
Day 26 of #30DaysOfKaggle:
Learned about ensemble method and used it.
Result: score improved from 0.1233 to 0.1161.
Rank: advanced 733 places to #677/4441! Hooray!
I'll stick to this model for now and will try to optimize it.
Day 27 of #30DaysOfKaggle:
Tweaked parameters and delivered multiple commits, none can improve the score accuracy.
Is it time to clean up the notebook and wrap up this whole exercise?
Day 28 of #30DaysOfKaggle:
Started cleaning the notebook: removing unused codes, reorganizing the flow, adding narratives and comments.
Not completed.
It took longer than I thought. I will continue tomorrow.