Thread by @PiyalBanik on Thread Reader App

#DataScience Project 3

Best Suburb to Open a Cafeteria in Melbourne 🇦🇺

- Create a Machine Learning model which suggests a location to open a Cafe.

Libraries Used
- Numpy
- Pandas
- Matplotlib
- Scikit Learn
- BeautifulSoup
- Geocoder
- Folium

Model Used:
- K Means Clustering

Please Note: the main focus of this project was on data collection, visualization, and training a model. Did not involve data cleaning.

Code for this project 👇
github.com/Piyal-Banik/Me…

1. Business Understanding:

The main goal of this project is to collect and analyze data in order to select a location in Melbourne to open a Cafeteria. We want to help a business owner planning to open up a Cafe in a location by exploring better facilities around the Suburb.

2. Analytical Approach:

This is an unsupervised machine learning problem where we need to group together suburbs having similar facilities. We will use K Means Clustering to solve this problem.

3. Data Requirements:

We would need a list of suburbs, the location of each suburb, and how many cafes are present in the suburb.

4. Data Collection:

- List of Suburbs in Melbourne, Australia which I have extracted from: en.wikipedia.org/wiki/Category:…

- Latitude & Longitude of all the suburbs using Geocoder

- venues in each suburb from foursquare API foursquare.com

5. Data Understanding

- The Wikipedia page contains a list of suburbs in Melbourne. There are 212 suburbs in Melbourne which I extracted using a web scraping technique with the help of Python BeautifulSoup and Request packages.

- the geographical coordinates such as latitude and longitude of each suburb were collected using Python’s Geocoder package.

- Then, Foursquare API was used to extract details about the various venues present in each suburb.

- Once, the location data was extracted by using Geocoder, I used the Folium package to visualize the data on a map. This ensured us that the data we retrieved was correct.

- Foursquare API was used to obtain the top 100 venues within a radius of 2000 meters.

6. Feature Engineering

- Converted the data into dummy variables using get_dummies method of Pandas package that will be essential for performing clustering algorithm

- Grouped the data by Suburb & also taking the mean of the frequency of occurrence of each category.

- I extracted the data of the Cafeteria only

- Our final data frame had two variables: suburb name and the mean of the frequency of occurrence of cafes

7. Modeling

- Performed clustering on the data using K-means clustering.

- Found out 3 clusters based on the frequency of occurrence of Cafes in each suburb.

- Found out the suburb which had the highest concentration of Cafes and also the lowest concentration

Results

Categorized the data into 3 categories using K-means clustering based on the frequency of occurrence for ‘Cafe’.
- Cluster 0: Suburbs with a low number of Cafes.
- Cluster 1: Suburbs with a moderate number of cafes.
- Cluster 2: Suburbs with a high concentration of Cafe.

Evaluation

- Cluster 0 is displayed as the red color represents a greater opportunity and high potential but also suffers from the risk of having fewer customers as those areas are not busy areas.

- As a new business owner it wouldn’t be wise enough to choose cluster 2.

Therefore, I would recommend that cluster 1 represented by blue color, should be chosen where there is medium competition but greater opportunity.

That's it for this project 👋

Please do let me know if you feel I have done some mistakes.

I am posting one Data Science Project each week

If you liked my content and want to get more threads on Data Science, Machine Learning & Python, do follow me @PiyalBanik

Like & retweet for the first one would mean a lot. Thank you

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll