Home

Process

After storing my data I need to check out data integrity, steps to define to ensure my data is clean, my data is ready to analyze, and documentation of data is possible to share my results. 

In this step I will use some tools depending on the size of my dataset, if my dataset is larger I will manipulate it in SQL but if I have a small dataset I can do it on the spreadsheet.


Using spreadsheets to check out each file. We need to eliminate and adjust each row's inconsistency, the step to avoid to guide us to failure results and bias.

null values

3.pngDelete empty rows
4.pngSometimes we can find a larger quantity column empty, in this case we use the function filter to find BLANK space, after we use function replace to classify empty values with NONE.
Fuction filter blank space
5.pngAfter we replace blank space and remove empty rows, we need to remove duplicate values of the column there are id, in this case extensions add ons can help improve the work.
Find Duplicate
6.pngCreate new Columns to find exactly time for each ride and which day.
Function WEEKDAY
7.pngFunction SUM()
8.pngAfter we create new columns, we search for data inconsistency in the columns. Also, we create filters to get negative values and value more 24 hours, to avoid bias and unfair value.
Continue filter to avoid bais
9.pngThe Cyclist rider can have a full day on the using the pass. I applied filter value gran than 24 hours and found unfair values.
forward we will use sql to count how many rows and to count to find pattens and bias
10.pngready to analyze
11.pngStarting with PostgreSQL and DbeaverAdmin.
Some dataset is almost impossible to prepare in a spreadsheet because the larger quantities of data, it's heavy to process, we need to find different ways, in this case I will use SQL, import the other csv for database and to prepare.
After we clean our data, we can use Dbeaver admin to manage the database, is useful to work with postgresql and another database
12.pngWhen we have larger quantities of data, the best practice is to use PSQL to import data for the database, This case I use command at PSQL because I need to save time and prepare data the way I can feel safe without risk. Use shell-script to import my dataset with command psql to access my database in postgresql.
13.pngStatement SELECT for visualization of data.
14.pngWhen we have imported all data for the database we can start to manipulate using function DML
15.pngData prepared to Analyze. We can update all columns to change the value NULL for NONE that provides the best visualization and organization to prepare to analyze
16.pngDatabase postgreSQL
17.png

This point data has been processed, prepared, cleaned and organized, now I can manipulate and to look for that answer to ask the question.

No comments:

Post a Comment

Extra Data manipulation.

The page "JSON and Python in PostgreSQL" is just extra data manipulation, the case study practice is how to handle API in the data...