Saturday, February 2, 2019

Using cut to bin data in pandas dataframe


In this tutorial, we will explore how to bin data in a pandas DataFrame using the cut function.


What is binning?


Binning is a way to group data into smaller containers called bins. For example, in surveys, an age question might collect data into ranges. An example of age bins might be: 0 - 25, 25 - 34, 35 - 49, 50 - 70, 70+.

Tutorial


Let's see how we can bin data using the pandas cut function.

First we import the pandas and numpy libraries. We'll use numpy to generate some sample data.


Now let's generate 1000 random samples using the random.normal function. This will be the data that we plan to bin. In the normal function, we pass in three arguments:

  • loc - the mean of the normal distribution
  • scale - standard deviation
  • size - numer of samples in the returned numpy array

In our case, we want 1000 samples, but we'll only print out the first 10 samples as a sanity check of the data.


Next we generate some fake labels to pretend we have a binary classification. We will not be binning this data, but it is just for example. Again, we'll print out the first 10 samples only to sanity check it.


Let's put these two arrays together to form a pandas DataFrame:


To create our bins, we use the linspace function in numpy to generate an array of evenly spaced numbers. The arguments for linspace are:

  • start - first number in our array
  • stop - last number in our array
  • num - how many numbers in our array


Now we can indicate which sample is in which bin with the cut function in pandas. We will bin the samples column, using the hist_bins values to indicate the actual bin boundaries. Note that I am also passing in right=False to indicate that the right boundary of each bin is open. If you want the right side to be closed and the left side to be open, pass in True.


Where binning becomes useful is when we want to apply some operation on it. For our example, let's do a groupby operation on the bins and then aggregate the labels data by performing a count operation it. Then we can do a cumsum (cumulative sum) on the labels count.

This is all a contrived example, of course, to give you ideas for your specific use case.

Summary


In this article, we learned how to bin our data using the pandas cut function, so that we could later perform some aggregate operations on the data. I hope you found this tutorial useful.

No comments: