Saturday, February 9, 2019

Quick and dirty histogram with Unix commands: cut, sort, uniq

Introduction

There are many occasions, usually daily, when I need to know a distribution of a field in a CSV. I can calculate this with many tools, such as Excel, R, or Python. One such way that you might not readily think of is doing this with Unix command line tools. This is useful because sometimes I don't need to write a script, I just want a quick and dirty way to look at the distribution. In this article, I'll show you how to do this.

Cut command

The first step in this task is to isolate the field we are interested in. Let's look at an example CSV file, with three fields:


$ cat example.csv
ID,fruit,yes_no
1,apple,yes
2,apple,yes
3,banana,yes
4,apple,yes
5,banana,no
6,orange,no
7,apple,no
8,orange,no
9,apple,no
10,apple,no
11,apple,yes
12,apple,no
13,banana,no
14,orange,no
15,apple,no

The field I'm interested in examining is the fruit field. Let's first isolate that with the cut command. The -f option is the field number, and in this case we use 2 for the second (1-based indexing of the columns). The -d option is the delimiter, and since this is a CSV file, we use ','. 


$ cut -f 2 -d ',' example.csv 
fruit
apple
apple
banana
apple
banana
orange
apple
orange
apple
apple
apple
apple
banana
orange
apple

Tail command 

This has one problem: we don't need the column header, so let's skip that with the tail command. Tail gets us the end of the file, and we want to start on line 2, so we pass in -n +2. Then we pass those results from tail to cut using the pipe ('|'). Now we have the values in the fruit column, but without the fruit header.


$ tail -n +2 example.csv | cut -f 2 -d ','
apple
apple
banana
apple
banana
orange
apple
orange
apple
apple
apple
apple
banana
orange
apple

Sort command

Now let's sort the values with the sort command:


$ tail -n +2 example.csv | cut -f 2 -d ',' | sort
apple
apple
apple
apple
apple
apple
apple
apple
apple
banana
banana
banana
orange
orange
orange

Uniq command

Then we can use the uniq command with the -c option to count the unique instances:


$ tail -n +2 example.csv | cut -f 2 -d ',' | sort | uniq -c
   9 apple
   3 banana
   3 orange

Saturday, February 2, 2019

Using cut to bin data in pandas dataframe


In this tutorial, we will explore how to bin data in a pandas DataFrame using the cut function.


What is binning?


Binning is a way to group data into smaller containers called bins. For example, in surveys, an age question might collect data into ranges. An example of age bins might be: 0 - 25, 25 - 34, 35 - 49, 50 - 70, 70+.

Tutorial


Let's see how we can bin data using the pandas cut function.

First we import the pandas and numpy libraries. We'll use numpy to generate some sample data.


Now let's generate 1000 random samples using the random.normal function. This will be the data that we plan to bin. In the normal function, we pass in three arguments:

  • loc - the mean of the normal distribution
  • scale - standard deviation
  • size - numer of samples in the returned numpy array

In our case, we want 1000 samples, but we'll only print out the first 10 samples as a sanity check of the data.


Next we generate some fake labels to pretend we have a binary classification. We will not be binning this data, but it is just for example. Again, we'll print out the first 10 samples only to sanity check it.


Let's put these two arrays together to form a pandas DataFrame:


To create our bins, we use the linspace function in numpy to generate an array of evenly spaced numbers. The arguments for linspace are:

  • start - first number in our array
  • stop - last number in our array
  • num - how many numbers in our array


Now we can indicate which sample is in which bin with the cut function in pandas. We will bin the samples column, using the hist_bins values to indicate the actual bin boundaries. Note that I am also passing in right=False to indicate that the right boundary of each bin is open. If you want the right side to be closed and the left side to be open, pass in True.


Where binning becomes useful is when we want to apply some operation on it. For our example, let's do a groupby operation on the bins and then aggregate the labels data by performing a count operation it. Then we can do a cumsum (cumulative sum) on the labels count.

This is all a contrived example, of course, to give you ideas for your specific use case.

Summary


In this article, we learned how to bin our data using the pandas cut function, so that we could later perform some aggregate operations on the data. I hope you found this tutorial useful.

Friday, February 1, 2019

Applying functions to a groupby of a pandas dataframe

In this article, I'm giving a few examples on applying functions as part of the "groupby" in pandas.
To start, think of the groupby function as the equivalent to a Pivot in Excel. In the Excel, a pivot allows you to summarize data, usually with a COUNT, SUM or MEAN function.
In pandas, the groupby function also allows you to perform some sort of aggregation function on segments or groups in your data.
So, let's first begin by importing the usual necessary libraries:
Next, for demonstration purposes, we will import a dataset about cars. The mpg dataset is a well-known dataset used for exploration and learning purposes. We can find a description of the dataset at the UCI Machine Learning repository: https://archive.ics.uci.edu/ml/datasets/auto+mpg. Note that we can download the set from the UCI site, but the dataset does not include headers. We would have to utilize a separate file that has the column names.
So, for ease, we will import this dataset from the seaborn library. You can browse the datasets included in the seaborn package here: https://github.com/mwaskom/seaborn-data. But since we don't actually need the seaborn library in this notebook, we will import directly from their github repo's data folder.
Let's take a peek at the first few rows to get a feel for what this data set looks like.
The model_year is a numeric field, but could also be used as a categoric field. Let's see what the distribution of value counts look like:
Likewise, we see that the cylinders field is also a discrete numeric field. Let's check that field:
The origin field is our last categoric field:
For our first groupby example, we group by originmodel_year, and cylinders, and summarize the weight of each group with the mean function.


Sometimes we need to apply more than one function to the groups. This can be accomplished with the agg function.
This example looks complicated, so let's break it down.
First we specify the columns in the groupby function, as we did in the last example: mpg['origin']mpg['model_year']mpg['cylinders'].
Then in the agg function, we pass in a dictionary with the field names as the keys, and the functions that we will apply as the values.
Finally, for clarity, we use the rename function to give meaningful names to our aggregated columns. Since we have performed different operations on the various columns, we append the function name to the column name. This is just a convention I like to use, so feel free to omit or modify as you like.


On some occasions, we want to use our own custom function, and the apply function allows us to do just that.
First, let's create a custom function, called reverse_cumsum. This is a rather contrived example, so just go along with it for the example.
Next I'm going to subset the dataset to only 4 cylinder cars from USA.
Now we can apply the custom reverse_cumsum function to our dataset with the apply function. Note that I'm applying this function to the mpg and weightfields in the dataset.


In this article, we demonstrated:
  1. how to use the groupby function to apply standard functions,
  2. how to use the agg function to apply multiple functions to different fields, and
  3. how to use the apply function to apply a custom function.
You can also take a look at the notebook.