Given a passenger's information, how can we predict whether he/she survived the Titanic disaster?
2. Analytical Approach:
Our target variable is categorical [survived / not survived], and hence we need classification models for this task.
3, 4. Data Requirements & Data Collection:
[Combined these two steps together as the datasets are given on Kaggle]
We are given 2 datasets, one for training our model and the other to test if our model can determine survival based on observations, not having the survival info.
5. Data Understanding
This step is part of Exploratory Data Analysis
The shape of the datasets
- Training set (891,12)
- Test set (418,11)
In total there are 12 features in the training set and 11 features in the test set 👇
Features with missing values
- Cabin
- Age
- Embarked
Statistical Information of the training dataset
Finding out the relationship of predictor variables with the target variables:
- Pclass = 1 more likely to survive
- Sex = Female more likely to survive
- most of age = 15-25 did not survive
- high fare had better survival
- Port of embarkation correlates with survival rates
6. Data Preparation
Cleaning steps based on analysis:
- Impute the missing Age values
- Turn age into an ordinal feature
- Impute missing Embarked values
- drop Cabin [too many missing values]
- drop Ticket [many duplicates]
- drop PassengerID, Name, SibSp, Parch [not helpful]
Feature Engineering Steps
Created Dummy Variables for
- Sex
- Embarked
7. Modeling
We are ready to train our model and predict the output.
Models trained
- Logistic Regression
- k-Nearest Neighbors
- Support Vector Machines
- Naive Bayes classifier
- Decision Tree
- Random Forest
8. Evaluation
Decision Tree and Random Forest achieved the maximum accuracy of 93.03%. We can choose anyone as a final model.
That's it for this tread 👋
Please do point out if you feel I have done some mistakes!
A retweet for the first one would really mean a lot 🙏
If you liked my content and want to get more threads on Data Science, Machine Learning & Python, do follow me @PiyalBanik
This book deals with manipulating, processing, cleaning, and crunching data in Python. It is about the parts of the Python language and libraries you’ll need to effectively solve a broad set of data analysis problems.
1) SL has a feedback mechanism.
UL has no feedback mechanism.
2) Supervised learning involves building a model for predicting, or estimating.
In unsupervised learning, we can learn relationships and structures from data
-regularization
-simpler model architecture
-more training data
-reduce noise in the data
-reduce the number of input attributes
-shorter training cycles
Few things to keep in mind before starting
- Learn By Doing, Practicing & Not Just Reading
- Code By Hand [very effective]
- Share, Teach, Discuss and Ask For Help
- Use Online Resources
- Be consistent
- Learn to Use Debugger
I have done all the below-mentioned concepts as part of the #100DaysOfCode challenge and the code can be found in my @github profile.
[Projects & exercise not done. let me know if you want the solutions]
Since we're currently in July, so start from this month.
Understanding Data Science and getting started with Python
- what is data science?
- what does a data scientist do?
- find out various resources
- Set up the system
- Learn Python basics
- Introduction to Pandas & Numpy
August -
Mathematics, Statistics & SQL
- Linear Algebra
- Introduction to Probability
- Statistics - inferential & descriptive
- Exploratory Data Analysis
- SQL for Data science
- Projects on EDA and SQL
Start engaging in the Data Science & Machine Learning community