Geeky Thought Bubbles: Quick and dirty histogram with Unix commands: cut, sort, uniq

Introduction

There are many occasions, usually daily, when I need to know a distribution of a field in a CSV. I can calculate this with many tools, such as Excel, R, or Python. One such way that you might not readily think of is doing this with Unix command line tools. This is useful because sometimes I don't need to write a script, I just want a quick and dirty way to look at the distribution. In this article, I'll show you how to do this.

Cut command

The first step in this task is to isolate the field we are interested in. Let's look at an example CSV file, with three fields:

$ cat example.csv

ID,fruit,yes_no

1,apple,yes

2,apple,yes

3,banana,yes

4,apple,yes

5,banana,no

6,orange,no

7,apple,no

8,orange,no

9,apple,no

10,apple,no

11,apple,yes

12,apple,no

13,banana,no

14,orange,no

15,apple,no

The field I'm interested in examining is the fruit field. Let's first isolate that with the cut command. The -f option is the field number, and in this case we use 2 for the second (1-based indexing of the columns). The -d option is the delimiter, and since this is a CSV file, we use ','.

$ cut -f 2 -d ',' example.csv

fruit

apple

banana

apple

banana

orange

apple

orange

apple

banana

orange

apple

Tail command

This has one problem: we don't need the column header, so let's skip that with the tail command. Tail gets us the end of the file, and we want to start on line 2, so we pass in -n +2. Then we pass those results from tail to cut using the pipe ('|'). Now we have the values in the fruit column, but without the fruit header.

$ tail -n +2 example.csv | cut -f 2 -d ','

apple

banana

apple

banana

orange

apple

orange

apple

banana

orange

apple

Sort command

Now let's sort the values with the sort command:

$ tail -n +2 example.csv | cut -f 2 -d ',' | sort

apple

banana

orange

Uniq command

Then we can use the uniq command with the -c option to count the unique instances:

$ tail -n +2 example.csv | cut -f 2 -d ',' | sort | uniq -c

9 apple

3 banana

3 orange

Geeky Thought Bubbles

Saturday, February 9, 2019

Quick and dirty histogram with Unix commands: cut, sort, uniq

Introduction

Cut command

Tail command

Sort command

Uniq command

No comments:

Popular Posts