After storing my data I need to check out data integrity, steps to define to ensure my data is clean, my data is ready to analyze, and documentation of data is possible to share my results.

In this step I will use some tools depending on the size of my dataset, if my dataset is larger I will manipulate it in SQL but if I have a small dataset I can do it on the spreadsheet.

Using spreadsheets to check out each file. We need to eliminate and adjust each row's inconsistency, the step to avoid to guide us to failure results and bias.

null values

Delete empty rows

Sometimes we can find a larger quantity column empty, in this case we use the function filter to find BLANK space, after we use function replace to classify empty values with NONE.
Fuction filter blank space

After we replace blank space and remove empty rows, we need to remove duplicate values of the column there are id, in this case extensions add ons can help improve the work.
Find Duplicate

Create new Columns to find exactly time for each ride and which day.
Function WEEKDAY

Function SUM()

After we create new columns, we search for data inconsistency in the columns. Also, we create filters to get negative values and value more 24 hours, to avoid bias and unfair value.
Continue filter to avoid bais

The Cyclist rider can have a full day on the using the pass. I applied filter value gran than 24 hours and found unfair values.
forward we will use sql to count how many rows and to count to find pattens and bias

ready to analyze

Starting with PostgreSQL and DbeaverAdmin.
Some dataset is almost impossible to prepare in a spreadsheet because the larger quantities of data, it's heavy to process, we need to find different ways, in this case I will use SQL, import the other csv for database and to prepare.
After we clean our data, we can use Dbeaver admin to manage the database, is useful to work with postgresql and another database

When we have larger quantities of data, the best practice is to use PSQL to import data for the database, This case I use command at PSQL because I need to save time and prepare data the way I can feel safe without risk. Use shell-script to import my dataset with command psql to access my database in postgresql.

Statement SELECT for visualization of data.

When we have imported all data for the database we can start to manipulate using function DML

Data prepared to Analyze. We can update all columns to change the value NULL for NONE that provides the best visualization and organization to prepare to analyze

Database postgreSQL

This point data has been processed, prepared, cleaned and organized, now I can manipulate and to look for that answer to ask the question.

Case Study BikeShare

Home

Process

null values

No comments:

Post a Comment

Extra Data manipulation.