Introduction
There are many occasions, usually daily, when I need to know a distribution of a field in a CSV. I can calculate this with many tools, such as Excel, R, or Python. One such way that you might not readily think of is doing this with Unix command line tools. This is useful because sometimes I don't need to write a script, I just want a quick and dirty way to look at the distribution. In this article, I'll show you how to do this.
Cut command
The first step in this task is to isolate the field we are interested in. Let's look at an example CSV file, with three fields:
$ cat example.csv
ID,fruit,yes_no
1,apple,yes
2,apple,yes
3,banana,yes
4,apple,yes
5,banana,no
6,orange,no
7,apple,no
8,orange,no
9,apple,no
10,apple,no
11,apple,yes
12,apple,no
13,banana,no
14,orange,no
15,apple,no
The field I'm interested in examining is the fruit field. Let's first isolate that with the cut command. The -f option is the field number, and in this case we use 2 for the second (1-based indexing of the columns). The -d option is the delimiter, and since this is a CSV file, we use ','.
$ cut -f 2 -d ',' example.csv
fruit
apple
apple
banana
apple
banana
orange
apple
orange
apple
apple
apple
apple
banana
orange
apple
Tail command
This has one problem: we don't need the column header, so let's skip that with the tail command. Tail gets us the end of the file, and we want to start on line 2, so we pass in -n +2. Then we pass those results from tail to cut using the pipe ('|'). Now we have the values in the fruit column, but without the fruit header.
$ tail -n +2 example.csv | cut -f 2 -d ','
apple
apple
banana
apple
banana
orange
apple
orange
apple
apple
apple
apple
banana
orange
apple
Sort command
Now let's sort the values with the sort command:
$ tail -n +2 example.csv | cut -f 2 -d ',' | sort
apple
apple
apple
apple
apple
apple
apple
apple
apple
banana
banana
banana
orange
orange
orange
Uniq command
Then we can use the uniq command with the -c option to count the unique instances:
$ tail -n +2 example.csv | cut -f 2 -d ',' | sort | uniq -c
9 apple
3 banana
3 orange