Maven Taxi Challenge

Arif Zainurrohman
Nerd For Tech
Published in
5 min readOct 24, 2021

--

Taxi

Introduction

This time we will be analyzing a massive data set, containing records from over 27 million taxi trips in New York City.

Our assignment is to wrangle this massive data set, perform a thorough QA and cleaning of the data, and present a dashboard that the Lead Dispatcher can use to accurately understand historical trip information.

The Dataset

The Dataset

We have 5 type of dataset, Trip dataset, Map zone, Calendar, Zone and Dictionary.

Trip Data

For the first, we will see trip data for 4 years, 2017,2018,2019, and 2020. We can see the trip data is very big, with around 27 million records.

Map Data

Second is Map data in taxi_zones_map.json file, we will use this file for create specific map in Power BI.

Fiscal Calendar Data

Next, Fiscal calendar data, we can see in the data like the picture above, there are Date, FiscalYear, DayName, etc.

Dictionary Data

And the last is Dictionary Data, we can use this data to understand the data.

Data Preparation & Data Cleaning

The first issue, the data is quite big with around 27 million records, so is very difficult for Power BI to process and will take a long time to process. So In this case, I have to import the data to some tool to make it easier and faster to process the data. And I choose SQL to solve this issue.

SSIS

SQL Server Integration Services is a platform for building enterprise-level data integration and data transformations solutions. Use Integration Services to solve complex business problems by copying or downloading files, loading data warehouses, cleansing and mining data, and managing SQL Server objects and data.

Create Package

In SSIS, Wee needs to create a project first and after that, we can create a package with a simple process. We can right-click on SSIS Packages, and we choose & apply for SSIS Import and Export Wizard… like the picture above.

Choose Data Source

Next, we choose the Data source, we can import 1 by 1 or we can create to import all, but for this case, I create packages for import 1 by 1.

Choose Destination Source

After we choose a data source, we need to choose a destination source also like the picture above.

Preview Data

Before finishing, we can preview the data to make sure no errors while we execute the packages after this.

Execute Package

For Execute the package, we can choose & right-click the package and click Execute Package.

Result Import Data

And the last, after we finished and success imported the data, in our database we will see the table like the picture above.

Merge Data and Prepare Data for Power BI

Merge Data

After we finish importing the data we need to merge the data into 1 table to reduce the process in Power BI.

Dataset Ready for Power BI

After merging the data trip, I found the data very big, and once again, to reduce the process in Power BI, I decided to join Trip data and Calendar data into 1 table.

Query Dataset for Power BI

The query joins like the picture above.

Visualization in Power BI

Import Data from SQL to Power BI

For the first, we need to import data from SQL into Power BI, we need to create a connection between SQL and Power BI. And don’t forget to import Zone data.

Create Relation between Data and Zone

Look at this, at the first, we have 5 datasets (Trip data, Map data, Zone data, Fiscal calendar data, and Dictionary data) but in Power BI we already reduce the process in SQL, so we only make 3 datasets in Power BI and this is very helpful.

We only create 1 relationship table (Data table and Zone table). After we create a relationship and enjoy to create creativity with all chart or graph that Power BI provide to us.

Map Visualization

For this Challenge, I am very excited because I can use the Map Shapefile function in Power BI, Finally … hahaha…

Map Visualization

First, we choose and apply the Shape map graph, and drag and drop LocationID Field into Location.

Add Map

Next, We need to focus the map into a map that we will display depending on the dataset.

Pick-Up and Drop-Off Area

And last we need to add pick-up and drop-off count to the color saturation. and finally, we have created the map visualization.

Result and Conclusion

NYC Taxi Trip Analysis Dashboard

Power BI is a Powerful data visualization tool, many features we can use in Power BI and I’m very excited to explore more and more.

Reference

SQL Server Integration Services — SQL Server Integration Services (SSIS) | Microsoft Docs

--

--

Arif Zainurrohman
Nerd For Tech

Corporate Data Analytics. Enthusiast in all things data, personal finance, and Fintech.