In this step I have data storage and organization, to manipulate and analyze them, we have 12 previous months to find answers and larger amounts of data. The best practice to work with larger data is SQL and R, but I will use spreadsheets as well to show some examples to handle that data.
Thinking about the questions we have to answer, I choose to create a temporary table, to manipulate specific parts of our data to find answers and to become the larger data light to analyze in the DbeaderAdmin.
Now we have many temporary tables for analyzing our data and finding answers. First I will start analyzing the month of January, the focus in the first question, and the objective of the Director of marketing and your manager
We need to avoid wrong results in our finds, in this dataset we have more than 96 thousands rows of registration.
Also I find 37 rows that are not necessary for our analyze, then we can use clause where to filter wrong data.
Now we filter our data using parameter duration <= ’24:00:00’.
Also we can use another FUNCTION together to improve our result for each month like;
GROUP BY, ORDER BY AND WHERE.
We can find different points and another way to see our data, to have a better analyze and understand what want to Now we can take average for each group and how many casual and member have in each month:
This month January casual usually spend more time than member even though the number of casuals is less on relation that member, for this analyze I use FUNCTION AVG() WITH COUNT()
Starting analyze in the RStudio.
This analysis is based on the Divvy case study "'Sophisticated, Clear, and Polished’: Divvy and Data Visualization" written by Kevin Hartman (found here: https://artscience.blog/home/divvy-dataviz-case-study). The purpose of this script is to consolidate downloaded Divvy data into a single dataframe and then conduct simple analysis to help answer the key question: “In what ways do members and casual riders use Divvy bikes differently:
!-Install required packages.
!-tidyverse for data import and wrangling.
!-lubridate for date functions.
!-ggplot for visualization.
library(tidyverse) #helps wrangle datalibrary(lubridate) #helps wrangle date attributeslibrary(ggplot2) #helps visualize datagetwd() #displays your working directory
# Also after we had manipulate dataset of 2021 in spreadsheet and PostgreSQL let's to import that data reference the year 2021
annual_trips <- read.csv('../input/bikeshare2021/2021_Annual_summary_bike_share - Sheet1.csv')
#Convert "duration" from Factor to numeric so we can run calculations on the data
is.factor(annual_trips$duration)
annual_trips$duration <- as.numeric(as.numeric(annual_trips$duration))
is.numeric(annual_trips$duration)
head(annual_trips)
A data.frame: 24 × 3 annual_trips$member_casual annual_trips$month annual_trips$ride_numbers <chr> <ord> <dbl> casual january 2584.286 member january 11243.857 casual february 1439.000 member february 5633.143 casual march 11983.000 member march 20636.143 casual april 19481.429 member april 28657.857 casual may 36644.000 member may 39240.429 casual june 52868.571 member june 51264.286 casual july 63070.143 member july 54329.714 casual august 58888.143 member august 55945.571 casual september 51933.143 member september 56028.571 casual october 36693.286 member october 53416.857 casual november 15254.286 member november 36144.714 casual december 9943.286 member december 25397.143
# Now we have aggregate our data for month and numbers of ride we can create out ggplot.
annual_trips %>%
group_by(month,member_casual) %>% # group months and member casual users.
summarise(ride_numbers = mean(ride_numbers)) %>% # take average of ride of ride by group and month
ggplot(aes(month,ride_numbers, fill = member_casual)) + # x= month and y =average of ride for group and month
geom_col(position = "dodge")+ # time of ggplot on this case we use column
facet_grid(~member_casual)+ #separete for grid
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) # align description`summarise()` has grouped output by 'month'. You can override using the`.groups` argument.
-----
# Now, let's run the average ride time by each month for members vs casual users aggregate(annual_trips$duration ~ annual_trips$member_casual + annual_trips$month, FUN = mean)
A data.frame: 24 × 3 annual_trips$member_casual annual_trips$month annual_trips$duration <chr> <ord> <dbl> casual january 38.42857 member january 99.00000 casual february 30.28571 member february 61.71429 casual march 261.42857 member march 198.57143 casual april 420.71429 member april 288.85714 casual may 804.85714 member may 394.00000 casual june 1078.14286 member june 513.00000 casual july 1205.28571 member july 530.57143 casual august 1076.71429 member august 538.85714 casual september 910.42857 member september 526.00000 casual october 582.28571 member october 454.14286 casual november 198.14286 member november 278.42857 casual december 125.71429 member december 190.85714
# Now we have aggregate our data for month and time duration we can create out ggplot.
annual_trips %>%
group_by(month,member_casual) %>% # group months and member casual users.
summarise(average_duration = mean(duration)) %>% # take average of duration of ride by group and month
ggplot(aes(month,average_duration, fill = member_casual)) + # x= month and y =average duration of ride for group and month
geom_col(position = "dodge")+ # time of ggplot on this case we use column
facet_grid(~member_casual)+ #separete for grid
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) # align description
`summarise()` has grouped output by 'month'. You can override using the
`.groups` argument.
No comments:
Post a Comment