Maven Taxi Challenge
Introduction
This time we will be analyzing a massive data set, containing records from over 27 million taxi trips in New York City.
Our assignment is to wrangle this massive data set, perform a thorough QA and cleaning of the data, and present a dashboard that the Lead Dispatcher can use to accurately understand historical trip information.
The Dataset
We have 5 type of dataset, Trip dataset, Map zone, Calendar, Zone and Dictionary.
For the first, we will see trip data for 4 years, 2017,2018,2019, and 2020. We can see the trip data is very big, with around 27 million records.
Second is Map data in taxi_zones_map.json file, we will use this file for create specific map in Power BI.
Next, Fiscal calendar data, we can see in the data like the picture above, there are Date, FiscalYear, DayName, etc.
And the last is Dictionary Data, we can use this data to understand the data.
Data Preparation & Data Cleaning
The first issue, the data is quite big with around 27 million records, so is very difficult for Power BI to process and will take a long time to process. So In this case, I have to import the data to some tool to make it easier and faster to process the data. And I choose SQL to solve this issue.
SSIS
SQL Server Integration Services is a platform for building enterprise-level data integration and data transformations solutions. Use Integration Services to solve complex business problems by copying or downloading files, loading data warehouses, cleansing and mining data, and managing SQL Server objects and data.
In SSIS, Wee needs to create a project first and after that, we can create a package with a simple process. We can right-click on SSIS Packages, and we choose & apply for SSIS Import and Export Wizard… like the picture above.
Next, we choose the Data source, we can import 1 by 1 or we can create to import all, but for this case, I create packages for import 1 by 1.
After we choose a data source, we need to choose a destination source also like the picture above.
Before finishing, we can preview the data to make sure no errors while we execute the packages after this.
For Execute the package, we can choose & right-click the package and click Execute Package.
And the last, after we finished and success imported the data, in our database we will see the table like the picture above.
Merge Data and Prepare Data for Power BI
After we finish importing the data we need to merge the data into 1 table to reduce the process in Power BI.
After merging the data trip, I found the data very big, and once again, to reduce the process in Power BI, I decided to join Trip data and Calendar data into 1 table.
The query joins like the picture above.
Visualization in Power BI
For the first, we need to import data from SQL into Power BI, we need to create a connection between SQL and Power BI. And don’t forget to import Zone data.
Look at this, at the first, we have 5 datasets (Trip data, Map data, Zone data, Fiscal calendar data, and Dictionary data) but in Power BI we already reduce the process in SQL, so we only make 3 datasets in Power BI and this is very helpful.
We only create 1 relationship table (Data table and Zone table). After we create a relationship and enjoy to create creativity with all chart or graph that Power BI provide to us.
Map Visualization
For this Challenge, I am very excited because I can use the Map Shapefile function in Power BI, Finally … hahaha…
First, we choose and apply the Shape map graph, and drag and drop LocationID Field into Location.
Next, We need to focus the map into a map that we will display depending on the dataset.
And last we need to add pick-up and drop-off count to the color saturation. and finally, we have created the map visualization.
Result and Conclusion
Power BI is a Powerful data visualization tool, many features we can use in Power BI and I’m very excited to explore more and more.
Reference
SQL Server Integration Services — SQL Server Integration Services (SSIS) | Microsoft Docs