Home

Analyze

In this step I have data storage and organization, to manipulate and analyze them, we have 12 previous months to find answers and larger amounts of data. The best practice to work with larger data is SQL and R, but I will use spreadsheets as well to show some examples to handle that data.

Thinking about the questions we have to answer, I choose to create a temporary table, to manipulate specific parts of our data to find answers and to become the larger data light to analyze in the DbeaderAdmin.

 

Now we have many temporary tables for analyzing our data and finding answers. First I will start analyzing the month of January, the focus in the first question, and the objective of the Director of marketing and your manager


18.png


We need to avoid wrong results in our finds, in this dataset we have more than 96 thousands rows of  registration.

Also I find 37 rows that are not necessary for our analyze, then we can use clause where to filter wrong data.

19.png

Now we filter our data using parameter duration <= ’24:00:00’.

20.png

Also we can use another FUNCTION together to improve our result for each month like;

GROUP BY, ORDER BY AND WHERE.

We can find different points and another way to see our data, to have a better analyze and understand what want to Now we can take average for each group and how many casual and member have in each month:  

21.png

This month January casual usually spend more time than member even though the number of casuals is less on relation that member, for this analyze I use FUNCTION AVG() WITH COUNT()
Starting analyze in the RStudio.

 

This analysis is based on the Divvy case study "'Sophisticated, Clear, and Polished’: Divvy and Data Visualization" written by Kevin Hartman (found here: https://artscience.blog/home/divvy-dataviz-case-study). The purpose of this script is to consolidate downloaded Divvy data into a single dataframe and then conduct simple analysis to help answer the key question: “In what ways do members and casual riders use Divvy bikes differently:

 

!-Install required packages.
!-tidyverse for data import and wrangling.
!-lubridate for date functions. 
!-ggplot for visualization.


library(tidyverse)  #helps wrangle data
library(lubridate)  #helps wrangle date attributes
library(ggplot2)  #helps visualize data
getwd() #displays your working directory
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
 ggplot2 3.3.6      purrr   0.3.4
 tibble  3.1.7      dplyr   1.0.9
 tidyr   1.2.0      stringr 1.4.0
 readr   2.1.2      forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()
Attaching package: ‘lubridate’ The following objects are masked from ‘package:base’  date, intersect, setdiff, union '/kaggle/working' 
# Also after we had manipulate dataset of 2021 in spreadsheet and PostgreSQL let's to import that data reference the year 2021
annual_trips <- read.csv('../input/bikeshare2021/2021_Annual_summary_bike_share - Sheet1.csv')

#Convert "duration" from Factor to numeric so we can run calculations on the data
is.factor(annual_trips$duration)
annual_trips$duration <- as.numeric(as.numeric(annual_trips$duration))
is.numeric(annual_trips$duration)
head(annual_trips)
FALSE
TRUE
A data.frame: 6 × 4
monthmember_casualride_numbersduration
<chr><chr><dbl><dbl>
1januarycasual208826
2januarymember1113592
3januarycasual209831
4januarymember11979105
5januarymember1058397
6januarymember888783
# Notice that the month of the year are out of order. Let's fix that.
annual_trips$month <- ordered(annual_trips$month, levels=c("january", "february", "march", "april", "may", "june", "july","august", "september","october", "november", "december"))
# We can also  to run the average numbers of ride by each month for member vs casuals users.
aggregate(annual_trips$ride_numbers ~ annual_trips$member_casual + annual_trips$month, FUN = mean)


A data.frame: 24 × 3
annual_trips$member_casualannual_trips$monthannual_trips$ride_numbers
<chr><ord><dbl>
casualjanuary 2584.286
memberjanuary 11243.857
casualfebruary 1439.000
memberfebruary 5633.143
casualmarch 11983.000
membermarch 20636.143
casualapril 19481.429
memberapril 28657.857
casualmay 36644.000
membermay 39240.429
casualjune 52868.571
memberjune 51264.286
casualjuly 63070.143
memberjuly 54329.714
casualaugust 58888.143
memberaugust 55945.571
casualseptember51933.143
memberseptember56028.571
casualoctober 36693.286
memberoctober 53416.857
casualnovember 15254.286
membernovember 36144.714
casualdecember 9943.286
memberdecember 25397.143

 


# Now we have aggregate our data for month and numbers of ride we can create out ggplot.

annual_trips %>% 
group_by(month,member_casual) %>% # group months and member casual users.
  summarise(ride_numbers = mean(ride_numbers)) %>% # take average of ride of ride by group and month
    ggplot(aes(month,ride_numbers, fill = member_casual)) + # x= month and y =average of ride for group and month
      geom_col(position = "dodge")+ # time of ggplot on this case we use column
        facet_grid(~member_casual)+ #separete for grid 
          theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) # align description`summarise()` has grouped output by 'month'. You can override using the`.groups` argument.


-----



# Now, let's run the average ride time by each month for members vs casual users
aggregate(annual_trips$duration ~ annual_trips$member_casual + annual_trips$month, FUN = mean)

A data.frame: 24 × 3
annual_trips$member_casualannual_trips$monthannual_trips$duration
<chr><ord><dbl>
casualjanuary 38.42857
memberjanuary 99.00000
casualfebruary 30.28571
memberfebruary 61.71429
casualmarch 261.42857
membermarch 198.57143
casualapril 420.71429
memberapril 288.85714
casualmay 804.85714
membermay 394.00000
casualjune 1078.14286
memberjune 513.00000
casualjuly 1205.28571
memberjuly 530.57143
casualaugust 1076.71429
memberaugust 538.85714
casualseptember910.42857
memberseptember526.00000
casualoctober 582.28571
memberoctober 454.14286
casualnovember 198.14286
membernovember 278.42857
casualdecember 125.71429
memberdecember 190.85714
# Now we have aggregate our data for month and time duration we can create out ggplot.

annual_trips %>% 
group_by(month,member_casual) %>% # group months and member casual users.
  summarise(average_duration = mean(duration)) %>% # take average of duration of ride by group and month
    ggplot(aes(month,average_duration, fill = member_casual)) + # x= month and y =average duration of ride for group and month
      geom_col(position = "dodge")+ # time of ggplot on this case we use column
        facet_grid(~member_casual)+ #separete for grid 
          theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) # align description

`summarise()` has grouped output by 'month'. You can override using the
`.groups` argument.


No comments:

Post a Comment

Extra Data manipulation.

The page "JSON and Python in PostgreSQL" is just extra data manipulation, the case study practice is how to handle API in the data...