Friday, February 1, 2019

Applying functions to a groupby of a pandas dataframe

In this article, I'm giving a few examples on applying functions as part of the "groupby" in pandas.
To start, think of the groupby function as the equivalent to a Pivot in Excel. In the Excel, a pivot allows you to summarize data, usually with a COUNT, SUM or MEAN function.
In pandas, the groupby function also allows you to perform some sort of aggregation function on segments or groups in your data.
So, let's first begin by importing the usual necessary libraries:
Next, for demonstration purposes, we will import a dataset about cars. The mpg dataset is a well-known dataset used for exploration and learning purposes. We can find a description of the dataset at the UCI Machine Learning repository: https://archive.ics.uci.edu/ml/datasets/auto+mpg. Note that we can download the set from the UCI site, but the dataset does not include headers. We would have to utilize a separate file that has the column names.
So, for ease, we will import this dataset from the seaborn library. You can browse the datasets included in the seaborn package here: https://github.com/mwaskom/seaborn-data. But since we don't actually need the seaborn library in this notebook, we will import directly from their github repo's data folder.
Let's take a peek at the first few rows to get a feel for what this data set looks like.
The model_year is a numeric field, but could also be used as a categoric field. Let's see what the distribution of value counts look like:
Likewise, we see that the cylinders field is also a discrete numeric field. Let's check that field:
The origin field is our last categoric field:
For our first groupby example, we group by originmodel_year, and cylinders, and summarize the weight of each group with the mean function.


Sometimes we need to apply more than one function to the groups. This can be accomplished with the agg function.
This example looks complicated, so let's break it down.
First we specify the columns in the groupby function, as we did in the last example: mpg['origin']mpg['model_year']mpg['cylinders'].
Then in the agg function, we pass in a dictionary with the field names as the keys, and the functions that we will apply as the values.
Finally, for clarity, we use the rename function to give meaningful names to our aggregated columns. Since we have performed different operations on the various columns, we append the function name to the column name. This is just a convention I like to use, so feel free to omit or modify as you like.


On some occasions, we want to use our own custom function, and the apply function allows us to do just that.
First, let's create a custom function, called reverse_cumsum. This is a rather contrived example, so just go along with it for the example.
Next I'm going to subset the dataset to only 4 cylinder cars from USA.
Now we can apply the custom reverse_cumsum function to our dataset with the apply function. Note that I'm applying this function to the mpg and weightfields in the dataset.


In this article, we demonstrated:
  1. how to use the groupby function to apply standard functions,
  2. how to use the agg function to apply multiple functions to different fields, and
  3. how to use the apply function to apply a custom function.
You can also take a look at the notebook.

No comments: