Saturday, February 9, 2019

Quick and dirty histogram with Unix commands: cut, sort, uniq

Introduction

There are many occasions, usually daily, when I need to know a distribution of a field in a CSV. I can calculate this with many tools, such as Excel, R, or Python. One such way that you might not readily think of is doing this with Unix command line tools. This is useful because sometimes I don't need to write a script, I just want a quick and dirty way to look at the distribution. In this article, I'll show you how to do this.

Cut command

The first step in this task is to isolate the field we are interested in. Let's look at an example CSV file, with three fields:


$ cat example.csv
ID,fruit,yes_no
1,apple,yes
2,apple,yes
3,banana,yes
4,apple,yes
5,banana,no
6,orange,no
7,apple,no
8,orange,no
9,apple,no
10,apple,no
11,apple,yes
12,apple,no
13,banana,no
14,orange,no
15,apple,no

The field I'm interested in examining is the fruit field. Let's first isolate that with the cut command. The -f option is the field number, and in this case we use 2 for the second (1-based indexing of the columns). The -d option is the delimiter, and since this is a CSV file, we use ','. 


$ cut -f 2 -d ',' example.csv 
fruit
apple
apple
banana
apple
banana
orange
apple
orange
apple
apple
apple
apple
banana
orange
apple

Tail command 

This has one problem: we don't need the column header, so let's skip that with the tail command. Tail gets us the end of the file, and we want to start on line 2, so we pass in -n +2. Then we pass those results from tail to cut using the pipe ('|'). Now we have the values in the fruit column, but without the fruit header.


$ tail -n +2 example.csv | cut -f 2 -d ','
apple
apple
banana
apple
banana
orange
apple
orange
apple
apple
apple
apple
banana
orange
apple

Sort command

Now let's sort the values with the sort command:


$ tail -n +2 example.csv | cut -f 2 -d ',' | sort
apple
apple
apple
apple
apple
apple
apple
apple
apple
banana
banana
banana
orange
orange
orange

Uniq command

Then we can use the uniq command with the -c option to count the unique instances:


$ tail -n +2 example.csv | cut -f 2 -d ',' | sort | uniq -c
   9 apple
   3 banana
   3 orange

No comments: