#DataScience Project 1

Titanic – Machine Learning from Disaster

Use Machine Learning to create a model that predicts which passengers survived the Titanic shipwreck.

Libraries Used
- Numpy
- Pandas
- Seaborn
- Sickit-Learn

Final Model Chosen
- Decision Tree: 93.03% accuracy🔥
The data science methodology followed has been outlined by John Rollins, IBM

- Business Understanding
- Analytical Approach
- Data requirements
- Data collection
- Data Understanding
- Data Preparation
- Modeling
- Evaluation

Project Code 👇
github.com/Piyal-Banik/Ti…
1. Business Understanding

Given a passenger's information, how can we predict whether he/she survived the Titanic disaster?

2. Analytical Approach:

Our target variable is categorical [survived / not survived], and hence we need classification models for this task.
3, 4. Data Requirements & Data Collection:

[Combined these two steps together as the datasets are given on Kaggle]

We are given 2 datasets, one for training our model and the other to test if our model can determine survival based on observations, not having the survival info.
5. Data Understanding

This step is part of Exploratory Data Analysis

The shape of the datasets
- Training set (891,12)
- Test set (418,11)

In total there are 12 features in the training set and 11 features in the test set 👇
Feature types
- Continous: Age, Fare
- Discrete: SibSp, Parch
- Categorical: Survived, Sex, and Embarked
- Ordinal: Pclass
- Mixed: Ticket
- Alphanumeric: Cabin

Features with missing values
- Cabin
- Age
- Embarked
Statistical Information of the training dataset
Finding out the relationship of predictor variables with the target variables:
- Pclass = 1 more likely to survive
- Sex = Female more likely to survive
- most of age = 15-25 did not survive
- high fare had better survival
- Port of embarkation correlates with survival rates
6. Data Preparation

Cleaning steps based on analysis:
- Impute the missing Age values
- Turn age into an ordinal feature
- Impute missing Embarked values
- drop Cabin [too many missing values]
- drop Ticket [many duplicates]
- drop PassengerID, Name, SibSp, Parch [not helpful]
Feature Engineering Steps

Created Dummy Variables for
- Sex
- Embarked
7. Modeling

We are ready to train our model and predict the output.

Models trained
- Logistic Regression
- k-Nearest Neighbors
- Support Vector Machines
- Naive Bayes classifier
- Decision Tree
- Random Forest
8. Evaluation

Decision Tree and Random Forest achieved the maximum accuracy of 93.03%. We can choose anyone as a final model.
That's it for this tread 👋

Please do point out if you feel I have done some mistakes!

A retweet for the first one would really mean a lot 🙏

If you liked my content and want to get more threads on Data Science, Machine Learning & Python, do follow me @PiyalBanik
Two mistakes
- scikit-learn spelling
- should not have mentioned the training accuracy, it's misleading. Test accuracy was 76.55

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Piyal Banik

Piyal Banik Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @PiyalBanik

26 Jul
Data Science Pipeline

🧵👇
Acknowledgment:

- John Rollins, @IBM

- Data Science Methodology, @coursera
coursera.org/learn/data-sci…
1. Business Understanding: What is the problem that we are trying to solve?

- We should have clarity of what is the exact problem we are going to solve.

- Asking the right questions as a Data Scientist starts with understanding the goal of the business.
Read 13 tweets
22 Jul
Data Science Books 📚 you should start reading

🧵👇
1. Data Science from Scratch

You’ll learn how many of the most fundamental DS tools and algorithms work by implementing them from scratch. Includes:

- Python basics
- Linear algebra, statistics, & probability
- Data collection & EDA
- Basic ML Algo

learning.oreilly.com/library/view/d…
2. Python for Data Analysis

This book deals with manipulating, processing, cleaning, and crunching data in Python. It is about the parts of the Python language and libraries you’ll need to effectively solve a broad set of data analysis problems.

learning.oreilly.com/library/view/p…
Read 11 tweets
18 Jul
"People need to know Maths to become Data Scientists or Machine Learning Engineer"

- True! 😀

But, how much do we need to know? 🤔⁉️

This thread 🧵 is an outline of the concepts we should know
1. Let's start with Linear Algebra

You can start working on Data Science or ML without knowing them.

But at some time you may wish to dive deeper.

If you ask me, if there was 1 area of Maths that I would suggest you improve before the other, it would be Linear Algebra.
If I could convince you to learn a minimum of Linear Algebra for Machine Learning, it would be the following👇:

- Systems of Linear Equations & Solving them
- Matrices
- Vector Spaces
- Linear Independence
- Basis & Rank
- Linear Mappings / Projections
Read 19 tweets
11 Jul
Here are this week's Data Science Interview Questions along with the correct answer

Thread 🧵👇

#MachineLearning #Python #100DaysOfCode
Answer by @josh_ko_naman

1) SL has a feedback mechanism.
UL has no feedback mechanism.

2) Supervised learning involves building a model for predicting, or estimating.
In unsupervised learning, we can learn relationships and structures from data

Answer by @ammaryh92 & @arunkumarai

-regularization
-simpler model architecture
-more training data
-reduce noise in the data
-reduce the number of input attributes
-shorter training cycles

Read 7 tweets
9 Jul
15 Days roadmap to master #Python basics for #DataScience & #MachineLearning without having any Prior Experience.

[ Join the #100DaysOfCode & #66daysofdata challenge to keep yourself motivated ]

Thread 🧵👇
Few things to keep in mind before starting
- Learn By Doing, Practicing & Not Just Reading
- Code By Hand [very effective]
- Share, Teach, Discuss and Ask For Help
- Use Online Resources
- Be consistent
- Learn to Use Debugger
I have done all the below-mentioned concepts as part of the #100DaysOfCode challenge and the code can be found in my @github profile.

[Projects & exercise not done. let me know if you want the solutions]

github.com/Piyal-Banik/10…
Read 21 tweets
3 Jul
Want to learn Data Science but confused about where to start and what to follow?

Here are the ultimate 12 months Learning path to becoming a Data Scientist 👨‍🎓

Note: I'm personally following this roadmap

🧵👇

#DataScience #MachineLearning #100DaysOfCode #66DaysOfCode #Python
Since we're currently in July, so start from this month.

Understanding Data Science and getting started with Python
- what is data science?
- what does a data scientist do?
- find out various resources
- Set up the system
- Learn Python basics
- Introduction to Pandas & Numpy
August -

Mathematics, Statistics & SQL
- Linear Algebra
- Introduction to Probability
- Statistics - inferential & descriptive
- Exploratory Data Analysis
- SQL for Data science
- Projects on EDA and SQL

Start engaging in the Data Science & Machine Learning community
Read 14 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(