Grouping for Aggregation, Filtration, and Transformation
This chapter covers the powerful groupby method, which allows you to group your data in any way imaginable and apply any type of function independently to each group before returning a single dataset.
In this chapter, we will cover the following topics:
- Defining an aggregation
- Grouping and aggregating with multiple columns and functions
- Removing the MultiIndex after grouping
- Customizing an aggregation function
- Customizing aggregating functions with args and *kwargs
- Examining the groupby object
- Filtering for states with a minority majority
- Transforming through a weight loss bet
- Calculating weighted mean SAT scores per state with apply
- Grouping by continuous variables
- Counting the total number of flights between cities
- Finding the longest streak of on-time flights
One of the most fundamental tasks during a data analysis involves splitting data into independent groups before performing a calculation on each group. This methodology has been around for quite some time but has more recently been referred to as split-apply-combine .
Hadley Wickham coined the term split-apply-combine to describe the common data analysis pattern of breaking up data into independent manageable chunks, independently applying functions to these chunks, and then combining the results back together. More details can be found in his paper.
Before we get started with the recipes, we will need to know just a little terminology. All basic groupby operations have grouping columns, and each unique combination of values in these columns represents an independent grouping of the data. The syntax looks as follows:
>>> df.groupby(['list', 'of', 'grouping', 'columns']) >>> df.groupby('single_column') # when grouping by a single column
The result of this operation returns a groupby object. It is this groupby object that will be the engine that drives all the calculations for this entire chapter. Pandas actually does very little when creating this groupby object, merely validating that grouping is possible. You will have to chain methods on this groupby object in order to unleash its powers.
Technically, the result of the operation will either be a DataFrameGroupBy or SeriesGroupBy but for simplicity, it will be referred to as the groupby object for the entire chapter.