Data Wrangling — Data Cleanup: Investigation, Matching, and Formatting

Arif Zainurrohman
Nerd For Tech
Published in
6 min readMay 30, 2021

--

Data Handling Process

Not All Data Is Created Equal

Although we’d like to believe in the veracity and quality of every dataset we see, not all datasets will measure up to our expectations. Even datasets we currently use could prove to be ineffective and inefficient sources after further research. As we explore automated solutions to the data wrangling problems we face, we will find the tools Python can help determine good versus bad data and help out the viability of our data.

Readability, Cleanliness, and Longevity

We can use Python to help us read illegible data, but the illegibility may mean the data doesn’t come from a good source. If it is massive and generated by a computer, that is one thing — database dumps are never pretty. However, if the data you have is illegible and from a human source, it may point to an issue of data cleanliness and variability.

Another issue we face is whether our data has already been cleaned. We can determine this by asking more about how the data is collected, reported, and updated.

Storing Your Data: When, Why, and How?

After we’ve located our data, we need a place to store it. Sometimes, we’ll have received data in a clean, easy-to-access, and machine-readable format. Other times, we might want to find a different way to store it. We’ll review some data storage tools to use when we first extract our data from a CSV or PDF, or we can wait and store our data once it’s fully processed and cleaned.

Data Cleanup: Investigation, Matching, and Formatting

Cleaning up our data is not the most glamourous of tasks, but it’s an essential part of data wrangling. Becoming a data cleaning expert requires precision and a healthy knowledge of your area of research or study. Knowing how to properly clean and assemble your data will set us miles apart from others in our field.

Python is well designed for data cleanup; it helps us build functions around patterns, eliminating repetitive work. As we’ve already seen in our code so far, learning to fix repetitive problems with scripts and code can turn hours of manual work into a script we run once.

Why Clean Data?

Some data may come to us properly formatted and ready to use. If this is the case, consider ourselves lucky. But most data, even if it is cleaned, has some formatting inconsistencies or readability issues (e.g., acronyms or mismatched description headers). This is especially true if we are using data from more than one dataset. It’s unlikely our data will properly join and be useful unless we spend time formatting and standardizing it.

If we’d like to present findings and publish our data, we’ll want to publish the cleaned version. This gives other data wranglers the opportunity to easily import and analyze the data. We can also publish the raw data alongside your finished dataset with notations on what steps we took to clean and normalize it.

Identifying Values for Data Cleanup

We begin our data clean-up with a simple review of the fields we find and any visual inconsistencies we can see. If we start our data cleanup by making our data look cleaner, we will have a good idea of the initial problems we must conquer as we normalize our data.

Identifying Values

Some weird values in inctot (9999999), and that ages cover quite a wide range, and that there are some n/a values in empstat.

Formatting Data

One of the most common forms of data cleanup is getting our unreadable or hard-to-read data and data types to fit a proper readable format. Especially if we need to create reports with the data or downloadable files, we’ll want to make sure it goes from being machine-readable to human-readable. And if our data needs to be used alongside APIs, we might need specially formatted data types.

Format Data

Finding Outliers and Bad Data

Identifying outliers and bad data in our dataset is probably one of the most difficult parts of data cleanup, and it takes time to get right. Even if we have a deep understanding of statistics and how outliers might affect our data, it’s always a topic to explore cautiously.

Finding Outliers and Bad Data

Finding Duplicates

If we are using more than one dataset with the same survey data or if we have used raw data that may have duplicate entries, removing duplicate data will be an important step in ensuring our data can be accurately used.

Finding Duplicates

Fuzzy Matching

If we are using more than one dataset or unclean, unstandardized data, we might use fuzzy matching to find and combine duplicates. Fuzzy matching allows us to determine if two items (usually strings) are “the same.” While not as in-depth as using natural language processing or machine learning to determine a match with big data‐ sets on language, fuzzy matching can help us relate “My dog & I” and “me and my dog” as having similar meaning.

Fuzzy Matching

RegEx Matching

Regular expressions allow computers to match, find, or eliminate patterns in strings or data defined in the code. Regular expressions, or regex, are often feared by developers since they can become complex and can be difficult to read and understand. However, they can be quite useful, and a basic introduction to them can help you read, write, and understand when regex can help solve your problem.

RegEx Matching

Conclusion

In this discussion, we learned the basics of data cleanup and why it’s an essential step in our data wrangling process.

Cleaning our data makes for easier storage, search, and reuse. It’s much easier to store our data in proper models if it’s cleaned first.

If we can standardize what we expect to see and clean or remove records that don’t fit, then we ensure our data’s consistency and eliminate hard work later when we need to query assets in our dataset.

Reference

Data Wrangling with Python; Author(s): Jacqueline Kazil, Katharine Jarmul; Release date: February 2016; Publisher(s): O’Reilly Media, Inc. ISBN: 9781491948811

Pattern matching in Python with Regex (tutorialspoint.com)

Find duplicate rows in a Dataframe based on all or selected columns — GeeksforGeeks

Detect and Remove the Outliers using Python — GeeksforGeeks

Fuzzy String Matching in Python — DataCamp

Pattern matching in Python with Regex (tutorialspoint.com)

--

--

Arif Zainurrohman
Nerd For Tech

Corporate Data Analytics. Enthusiast in all things data, personal finance, and Fintech.