Hi, Happy holidays. Its been months since my last post, because work.. and life.. still, over one hundred new visitors monthly on this sparse blog.. :)Thank you all..
This post is different. It’s on a case study for the Google Data Analytics Certificate I just completed. It’ll be a long write up detailing the analysis process for the case study.
The scenario is: I’m a junior data analyst for Cyclistic, a fictional Bike share company in Chicago. I’ll share the slides presentation, then walk through the data analysis process.
The tools used in this study are:
– Microsoft Excel
– Rstudio
– GitHub
– Tableau
– Adobe Reader
– Google Slides
The analysis process is categorized into 6 phases.
1. Ask
2. Prepare
3. Process
4. Analyze
5. Share
6. Act
- Ask
In the Ask phase, we do two things.
a). We define the problem to be solved (or the business case)
The problem to be solved is usually given by your team leader or manager. For this project, it is provided by the case study assignment.
Show how Annual Members and Casual Riders use Cylistic bikes differently.
b). We make sure we fully understand the stakeholders expectations.
Our stakeholders here are:
– Lily Moreno (Team Leader/Manager)
– The marketing team
– Cyclistic Executives.
They want to create marketing strategies to convert casual riders to annual members for the company to earn more money. To do this, the stakeholders need to understand how casual riders use the company product differently from annual members. To help them understand, our task is to show the differences. - Prepare
In Prepare phase, we’ll verifiy our data source integrity, credibility and reliability. Then collect and store the data for use in our analysis.
Our data source here is provided by the case study assignment. Link with access to 12 csv files hosted on AWS for each month from Nov 2020 to Oct 2021. I downloaded the files for study in Excel. I think the best way to get to know your data is in Excel. All 12 files containing data on all rides for each month have the same structure of 13 columns.
1. ride i_d
2. bike_type
3. date/time_of _ride_start
4. date/time_of_ride_end
5. start_station_name
6. start_station_id
7. end_station_name
8. end_station_id
9. start_station_latitude
10. start_station_logitude
11. end_station_latittude
12. end_station_logitude
13. customer_type
After getting to know the data, I did some quick data cleaning on all 12 files, by
1. Editing a few column names.
2. Adding one more column on all 12 files. “day_of_the_week” to show the day of the week the customer rented a bike. The Excel formula for this is : TEXT((reference cell), “dddd”)
3. Save editied files as an excel workbook (.xlsx). - Process
Here, we do more cleaning. Combining datasets and transforming it to make our information more useful for analysis. This includes removing inconsistencies, outliers, typos etc that could skew our analysis.
For this phase, I chose R. SQL may probably be more practical here, but I like R a lot more, I think its very elegant. Its my preferred programming language.
The reason we’re moving to R from Excel mainly is because we need to combine all 12 files into one and they won’t fit in Excel. Excel has a maximum capacity of about 1 million rows. These 12 files combined is around 5.4 million rows.
In R-studio, we
1. Combined all 12 files into one.
2. Calculated the average length of each ride. “avg_ride_length” column added.
3. Further cleaning by deleting inconsistencies such as negative ride lengths and “TEST” rides. Around 1,500 rows was removed, approximately 0.03% of our data.
4. Export the clean data as a .csv file for further analysis and visualization in Tableau.
5. Save the R-script for future reference.
For libraries and functions I used in R, you can view my R-script on GitHub. - Analyze
In the Analyze phase, basically we try to make sense of our processed data. We transform and organize it into useful information. So that we can draw conclusions, make predictions and drive informed decision making.
Here, I chose Tableau. Although we can do analysis and create charts in R with ggplot2, it’s amazing what you can do in R.
Still, I don’t think writing lines of code is the best way to explore data for insights. In R, you have to already know what your output will look like, then you write the code to generate it.
With Tableau, you don’t really know. You explore, play with, filter, mix, match and generate visuals from the data to make complex concepts and numbers easier to see and understand.
Using Tableau Desktop, I connected to the csv file downloaded from Rstudio, then started exploring to see what we can come up with. This is the most fun part really. From our data, we were able to extract nine unique insights shown in charts. Once satisfied with our exploration and insights gotten, we then edit our visuals to make it clear, simple and to best show what the data is saying. - Share
In the share phase, we’ll interpret the results of our analysis and share our conclusion and recommendations to help stakeholders make data-driven decisions.
We’re done editing our charts in Tableau Desktop.
Next, we save our work to Tableau Public online. Tableau doesn’t let you save your work locally without first uploading it to your account online.
This done, we then download it in PDF. PNG files from Tableau public is not good. Best option is PDF. Save the pdf files and open it up in Adobe reader and then export as a jpeg file. Then upload the jpegs in creating our Google Slides in Chrome. - Act
This is the final phase in Data Analysis. Here, we take all of the insights we’ve got from our data and put them to work to solve the original business case: To show how Annual Members and Casual Riders use Cyclistic bikes differently so that our stakeholders can create marketing strategies to convert casual riders to annual members for the company to earn more money. Here’s a good time to go through the Google slides again and you can see that we have successfully accomplished our task.
Act is what we’re doing right now. Me writing this and you reading.. if it was for real, there will be a presentation of the slides and stakeholders nodding heads in comprehension like in the movies saying stuff like “excellent” then going back to the “drawing board” and creating marketing strategies/allocating budget based on the insights and recommendations from the data analysis. harharhar..
Alright! This brings us to the end of this post. Not really sure why I’m smiling r/n but this again was fun. Thank you for being here. Your curiosity and our shared interest brought you. Its the end of the year. Happy Holidays!!