A few days back, 2 Data Analyst newbies showed me mistakes they encountered during their analysis process. It dawned on me that lots of times, we don't diligently follow through the data analysis process, especially the cleaning phase.
In this short🧵, I break down this process.
PHASE 1: ASK QUESTIONS
The first step in the analysis process is to formulate a problem to solve or a question to answer, such as: are customers aged 21 to 30 more likely to churn than customers aged 41 to 50? As analysts, we must be very curious, we must ASK QUESTIONS!
PHASE 2: Collect & Store Data
Next, we need to collect and store the necessary data, which could require the use of a database or a spreadsheet. For large datasets, we store them in a database. For datasets shy of a million records, spreadsheet software like Excel can be used.
PHASE 3: Explore Data
Once collected, we need to explore our data to familiarize ourselves with its contents. At this stage, we may find biases in the collected data or discover missing or erroneous entries. This leads us to one of the most important steps
PHASE 4: DATA CLEANING
Once we understand the data, it must be cleaned and prepared for analysis. If our question requires the ages of customers, but we only collected dates of birth, we must convert them at this stage.
A well-cleaned and prepared dataset will make analyzing the data much easier. As data analyst, we have heard several times that this is your most important step. I have seen @malcom_okonkwo , @Rita_tyna, @phaibooboo , and a few other analysts tell us of how it took them hours,
days and even weeks to clean their datasets. The simple reason for this thorough process is to ensure your analysis is built on a very solid foundation. Trust me, you really do not want to cost your clients millions just because you didn't thoroughly clean your data.
PHASE 5: ANALYZE DATA
Now onto the actual analysis! Here's where we solve the problem we formulated. The analysis usually involves performing calculations and often a statistical analysis. This is where we employ all the tools required for our analysis.
As a BI analyst, you will work mostly with BI tools like Power BI, Tableau, Looker, Domo, Datapine, and Microstrategy amongst a few. As a Spreadsheet analyst, you will work mostly with Microsoft Excel, Google Sheets. As a programming analyst, you will mostly work with Python, R
or any other programming tool. As a Database analyst, you will work mostly with Relational Database Management Systems like PostgreSQL, MSQL, MySQl, Oracle.
PHASE 6: PRESENT FINDINGS
For the final phase , we present our results to key stakeholders, which often includes high-level findings from the analysis and supporting visualizations.
Let's break them down: 1. Intuition
The query aims to retrieve information about managers in the Employees table, including their IDs, names, the number of employees reporting to them, and the average age of their reports.
The solution involves using a self join. Since we were told that a manager is also an employee, we will join the table to itself, where the employee_id field is equal reports_to field. On this premise, we aggregate the number of employees who report directly to the manager,
Let's break down each question 1. Logic: Show the classes, where the number of student is more than 5. The word "At least" tells us something, we have to include 5 as part of our filtering condition, hence >=5 2. This looks pretty straight forward. For each user, we want to....
..return the number followers, hence the need to use the Group BY clause. This clause allows us to aggregate data against the unique categorical/qualitative values in a field. 3. The given SQL query finds the maximum value (num) from the "mynumbers" table that appears only once.
Let's break down each code 1. The ending part to the question is where the major conditional logic lies. We were told to return the needed columns, and price for the first year of every product sold. Since we do not have distinct column showing min years for each product,
We have to create that with a subquery in the Where clause. We are concerned about the first year of every product sold, so we aggregate the min years for each product. After doing that, we tell SQL to return the prices of products, the year, and qty when the product was first..