Data Exploration and Transformation

In this chapter, you will learn more about data transformation in pandas and how to get comfortable with data manipulation to achieve a dataset that is ready for data analysis.

By the end of this chapter, you will understand how to deal with messy or missing data and how to summarize it for the purpose of your analysis.

In this chapter, we will cover the following topics:

  • Dealing with messy data
  • Dealing with missing data
  • Summarizing data
  • Activity 7.01 -- data analysis using pivot tables

Introduction to data transformation

When working with data science, it is important to ensure that your dataset has been cleaned of all the messy data, that is, all of the missing data has been handled correctly. Otherwise, you could end up getting unexpected results when summarizing your dataset and deriving insights. For example, if you want to calculate an average but haven't cleaned up missing data that might be arbitrarily represented as a specific number, such as -999, you could calculate an incorrect aggregation (such as an average) that will include that specific number, -999. Having a good understanding of that arbitrary convention (with -999 representing the missing data) will allow you to exclude that number from any calculation to avoid reporting incorrect aggregations. A good understanding of how to handle messy and missing data in pandas will increase the confidence and accuracy of your analysis.