Piyal Banik Profile picture
👨‍🎓 MSc Student in Data Science 🤖 Machine Learning Lead @gdsc_cdtu 🎯 My goal is to make your Data Science & Machine Learning Journey Easy

Jul 25, 2021, 15 tweets

#DataScience Project 1

Titanic – Machine Learning from Disaster

Use Machine Learning to create a model that predicts which passengers survived the Titanic shipwreck.

Libraries Used
- Numpy
- Pandas
- Seaborn
- Sickit-Learn

Final Model Chosen
- Decision Tree: 93.03% accuracy🔥

The data science methodology followed has been outlined by John Rollins, IBM

- Business Understanding
- Analytical Approach
- Data requirements
- Data collection
- Data Understanding
- Data Preparation
- Modeling
- Evaluation

Project Code 👇
github.com/Piyal-Banik/Ti…

1. Business Understanding

Given a passenger's information, how can we predict whether he/she survived the Titanic disaster?

2. Analytical Approach:

Our target variable is categorical [survived / not survived], and hence we need classification models for this task.

3, 4. Data Requirements & Data Collection:

[Combined these two steps together as the datasets are given on Kaggle]

We are given 2 datasets, one for training our model and the other to test if our model can determine survival based on observations, not having the survival info.

5. Data Understanding

This step is part of Exploratory Data Analysis

The shape of the datasets
- Training set (891,12)
- Test set (418,11)

In total there are 12 features in the training set and 11 features in the test set 👇

Feature types
- Continous: Age, Fare
- Discrete: SibSp, Parch
- Categorical: Survived, Sex, and Embarked
- Ordinal: Pclass
- Mixed: Ticket
- Alphanumeric: Cabin

Features with missing values
- Cabin
- Age
- Embarked

Statistical Information of the training dataset

Finding out the relationship of predictor variables with the target variables:
- Pclass = 1 more likely to survive
- Sex = Female more likely to survive
- most of age = 15-25 did not survive
- high fare had better survival
- Port of embarkation correlates with survival rates

6. Data Preparation

Cleaning steps based on analysis:
- Impute the missing Age values
- Turn age into an ordinal feature
- Impute missing Embarked values
- drop Cabin [too many missing values]
- drop Ticket [many duplicates]
- drop PassengerID, Name, SibSp, Parch [not helpful]

Feature Engineering Steps

Created Dummy Variables for
- Sex
- Embarked

7. Modeling

We are ready to train our model and predict the output.

Models trained
- Logistic Regression
- k-Nearest Neighbors
- Support Vector Machines
- Naive Bayes classifier
- Decision Tree
- Random Forest

8. Evaluation

Decision Tree and Random Forest achieved the maximum accuracy of 93.03%. We can choose anyone as a final model.

That's it for this tread 👋

Please do point out if you feel I have done some mistakes!

A retweet for the first one would really mean a lot 🙏

If you liked my content and want to get more threads on Data Science, Machine Learning & Python, do follow me @PiyalBanik

Here is the ladder 🪜 to the top

Two mistakes
- scikit-learn spelling
- should not have mentioned the training accuracy, it's misleading. Test accuracy was 76.55

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling