Saturday, February 9, 2019

Quick and dirty histogram with Unix commands: cut, sort, uniq

Introduction

There are many occasions, usually daily, when I need to know a distribution of a field in a CSV. I can calculate this with many tools, such as Excel, R, or Python. One such way that you might not readily think of is doing this with Unix command line tools. This is useful because sometimes I don't need to write a script, I just want a quick and dirty way to look at the distribution. In this article, I'll show you how to do this.

Cut command

The first step in this task is to isolate the field we are interested in. Let's look at an example CSV file, with three fields:


$ cat example.csv
ID,fruit,yes_no
1,apple,yes
2,apple,yes
3,banana,yes
4,apple,yes
5,banana,no
6,orange,no
7,apple,no
8,orange,no
9,apple,no
10,apple,no
11,apple,yes
12,apple,no
13,banana,no
14,orange,no
15,apple,no

The field I'm interested in examining is the fruit field. Let's first isolate that with the cut command. The -f option is the field number, and in this case we use 2 for the second (1-based indexing of the columns). The -d option is the delimiter, and since this is a CSV file, we use ','. 


$ cut -f 2 -d ',' example.csv 
fruit
apple
apple
banana
apple
banana
orange
apple
orange
apple
apple
apple
apple
banana
orange
apple

Tail command 

This has one problem: we don't need the column header, so let's skip that with the tail command. Tail gets us the end of the file, and we want to start on line 2, so we pass in -n +2. Then we pass those results from tail to cut using the pipe ('|'). Now we have the values in the fruit column, but without the fruit header.


$ tail -n +2 example.csv | cut -f 2 -d ','
apple
apple
banana
apple
banana
orange
apple
orange
apple
apple
apple
apple
banana
orange
apple

Sort command

Now let's sort the values with the sort command:


$ tail -n +2 example.csv | cut -f 2 -d ',' | sort
apple
apple
apple
apple
apple
apple
apple
apple
apple
banana
banana
banana
orange
orange
orange

Uniq command

Then we can use the uniq command with the -c option to count the unique instances:


$ tail -n +2 example.csv | cut -f 2 -d ',' | sort | uniq -c
   9 apple
   3 banana
   3 orange

Saturday, February 2, 2019

Using cut to bin data in pandas dataframe


In this tutorial, we will explore how to bin data in a pandas DataFrame using the cut function.


What is binning?


Binning is a way to group data into smaller containers called bins. For example, in surveys, an age question might collect data into ranges. An example of age bins might be: 0 - 25, 25 - 34, 35 - 49, 50 - 70, 70+.

Tutorial


Let's see how we can bin data using the pandas cut function.

First we import the pandas and numpy libraries. We'll use numpy to generate some sample data.


Now let's generate 1000 random samples using the random.normal function. This will be the data that we plan to bin. In the normal function, we pass in three arguments:

  • loc - the mean of the normal distribution
  • scale - standard deviation
  • size - numer of samples in the returned numpy array

In our case, we want 1000 samples, but we'll only print out the first 10 samples as a sanity check of the data.


Next we generate some fake labels to pretend we have a binary classification. We will not be binning this data, but it is just for example. Again, we'll print out the first 10 samples only to sanity check it.


Let's put these two arrays together to form a pandas DataFrame:


To create our bins, we use the linspace function in numpy to generate an array of evenly spaced numbers. The arguments for linspace are:

  • start - first number in our array
  • stop - last number in our array
  • num - how many numbers in our array


Now we can indicate which sample is in which bin with the cut function in pandas. We will bin the samples column, using the hist_bins values to indicate the actual bin boundaries. Note that I am also passing in right=False to indicate that the right boundary of each bin is open. If you want the right side to be closed and the left side to be open, pass in True.


Where binning becomes useful is when we want to apply some operation on it. For our example, let's do a groupby operation on the bins and then aggregate the labels data by performing a count operation it. Then we can do a cumsum (cumulative sum) on the labels count.

This is all a contrived example, of course, to give you ideas for your specific use case.

Summary


In this article, we learned how to bin our data using the pandas cut function, so that we could later perform some aggregate operations on the data. I hope you found this tutorial useful.

Friday, February 1, 2019

Applying functions to a groupby of a pandas dataframe

In this article, I'm giving a few examples on applying functions as part of the "groupby" in pandas.
To start, think of the groupby function as the equivalent to a Pivot in Excel. In the Excel, a pivot allows you to summarize data, usually with a COUNT, SUM or MEAN function.
In pandas, the groupby function also allows you to perform some sort of aggregation function on segments or groups in your data.
So, let's first begin by importing the usual necessary libraries:
Next, for demonstration purposes, we will import a dataset about cars. The mpg dataset is a well-known dataset used for exploration and learning purposes. We can find a description of the dataset at the UCI Machine Learning repository: https://archive.ics.uci.edu/ml/datasets/auto+mpg. Note that we can download the set from the UCI site, but the dataset does not include headers. We would have to utilize a separate file that has the column names.
So, for ease, we will import this dataset from the seaborn library. You can browse the datasets included in the seaborn package here: https://github.com/mwaskom/seaborn-data. But since we don't actually need the seaborn library in this notebook, we will import directly from their github repo's data folder.
Let's take a peek at the first few rows to get a feel for what this data set looks like.
The model_year is a numeric field, but could also be used as a categoric field. Let's see what the distribution of value counts look like:
Likewise, we see that the cylinders field is also a discrete numeric field. Let's check that field:
The origin field is our last categoric field:
For our first groupby example, we group by originmodel_year, and cylinders, and summarize the weight of each group with the mean function.


Sometimes we need to apply more than one function to the groups. This can be accomplished with the agg function.
This example looks complicated, so let's break it down.
First we specify the columns in the groupby function, as we did in the last example: mpg['origin']mpg['model_year']mpg['cylinders'].
Then in the agg function, we pass in a dictionary with the field names as the keys, and the functions that we will apply as the values.
Finally, for clarity, we use the rename function to give meaningful names to our aggregated columns. Since we have performed different operations on the various columns, we append the function name to the column name. This is just a convention I like to use, so feel free to omit or modify as you like.


On some occasions, we want to use our own custom function, and the apply function allows us to do just that.
First, let's create a custom function, called reverse_cumsum. This is a rather contrived example, so just go along with it for the example.
Next I'm going to subset the dataset to only 4 cylinder cars from USA.
Now we can apply the custom reverse_cumsum function to our dataset with the apply function. Note that I'm applying this function to the mpg and weightfields in the dataset.


In this article, we demonstrated:
  1. how to use the groupby function to apply standard functions,
  2. how to use the agg function to apply multiple functions to different fields, and
  3. how to use the apply function to apply a custom function.
You can also take a look at the notebook.

Monday, February 5, 2018

Updating Mac OS X messed up my git ssh settings

I recently updated to Mac OS High Sierra (10.13). Yeah, I know, it's been out since last year. But I finally got around to updating my laptop.

And then when I tried to git push, I got an error:


git@github.com: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
 Which was weird because I was able to create the remote repo. I just couldn't push to it.

I investigated my ssh key situation, and it appears the Mac OS upgrade removed my github key. Luckily, I had the key still in my ~/.ssh directory and was able to rectify the situation with:

ssh-add ~/.ssh/github_user-GitHub
Then all was good again. Whew!

Sunday, January 16, 2011

Beta testing with Android on AT&T? Not!

HTC Aria Android Phone (AT&T) 
I recently was asked to do some beta testing of an Android application. I had never done mobile testing before, so I eagerly accepted the assignment. However I did not know that AT&T had blocked loading applications that were not from the Android Marketplace onto my HTC Aria (affiliate link). This article from Computerworld describes this in more detail. After two hours of trying to load the beta application (which was an APK file) and researching why I couldn’t, I concluded that the only way I could do this would be to root my phone. Since this voids the warranty, and I’ve only had the phone for about 6 months, I decided against this. So, sadly, I had to decline the beta testing.

Other than this problem, I’ve been pretty happy with my HTC Aria. I like the integration with my Google mail and calendar, which is how I manage my day-to-day life. I’m just disappointed to not get any mobile testing experience with it.

Sunday, December 26, 2010

Review of the Kindle 3G + Wi-Fi Wireless Reading Device

Kindle 3G Wireless Reading Device, Free 3G + Wi-Fi, 3G Works Globally, Graphite, 6" Display with New E Ink Pearl Technology

I’ve always favored reading actual books versus reading books on a computer. I have resisted e-readers in general, because I don’t buy very many books, but rather trade them, usually on paperbackswap.com. But in my current quest to reduce clutter, I thought I would try reading books in Kindle format.


Kindle software for non-Kindle devices

To test if I would even like reading digital books, I first installed the Kindle readers on my Mac and on my Android phone. I bought one book via the web, and tested out reading on each platform.

Despite the small size, reading Kindle on the Android is not that bad for short amounts of time. If I have a few minutes to wait for something, I’ll often take my phone out and read a few pages from my phone. It was very convenient for me to capture small slices of time I would normally be doing nothing.

When you close the book, Kindle will save your place to synchronize with other devices with the same book. Amazon calls this WhisperSync. This feature is really nice, in that, later in the day, I can read the same book on a different device, and it will synchronize and ask if I want to continue in the new location.

The actual Kindle reader

Being thrilled with this new way of reading books, I decided to bite the bullet and buy the latest Kindle. I went with the 3G + Wifi version of the Kindle (affiliate link). For $50 more than the WiFi-only version, I get free 3G access anywhere and anytime. I thought that this was a good bargain.

When the Kindle arrived, I was amazed at the e-Ink technology. It doesn’t really look like the device is on because the display does not have any glare or glow. The screen really does resemble a printed page more than a computer display. Since there is no backlight or glow, you do need to have a light source to read the Kindle.

The page turning on the Kindle reader seemed a little slow, and the placement of the buttons is a little awkward at first. In the beginning, I would accidentally turn the page while just shifting the reader in my hands. I’m getting better at holding the Kindle so as not to accidentally turn the page. And I’ve ordered a book-like cover for it, which I imagine will help with this awkwardness.

After downloading a few more books (mostly free), I noticed that I could categorize my books into folders within the Kindle. After I get more books loaded onto it, I think this will be more helpful in locating a particular book. It would be nice, however, to create sub-folders, but I can live with only one level of folders for now.

The web browser

I checked out the Kindle’s experimental web browser. It is also a bit awkward at first to move the cursor with the arrow keys. I was able to view Google, as well as my Google Mail account, though. I think if you needed to check your mail quickly without having to wait to boot up a computer, this method would suffice quite well.

Importing PDFs

Recently, I’ve purchased a few e-books that are only in PDF. The Kindle uses it’s own format, though. Books you purchase from the Kindle Store are in AWZ format. The Kindle can also read MOBI format as well.

One way to convert PDF to MOBI is to use Amazon’s conversion service. You email the PDF document to a Kindle email account you set up on Amazon.com. I’ve tried this out on a few documents myself with mixed results. Usually the PDFs that I have downloaded are formatted with text and graphics meant to be read in landscape mode. The books that I converted and sent to my Kindle lost the formatting (of course) with some books even containing blank pages. But the content was still there, and for me, this was acceptable. Of the 4 PDF books I tried to convert, I had one book that did not convert well. There were sentences or paragraphs missing. This is not acceptable to me.

The conversion method I have yet to try is outlined by Benny Lewis in his article How to convert PDFs/daily news/anything to ePUB/mobi for your eReader/Kindle.

Summary

Overall, I’ve been thrilled with the Kindle. I find that I am actually reading a lot more now, and am reading several books at a time. It is nice to flip back and forth between various books to suit my mood. And it is great to be able to save a few trees. Even though I’m a bit late to the party, I’m now finding that reading regular books is so last century and the Kindle is the way to go.

Friday, December 17, 2010

How to configure your blogspot blog to use a custom domain from Namecheap

I recently noticed that the geekythoughtbubbles.com domain name was available. It was a similar tech blog to this. So, I decided to buy the domain with Namecheap (affiliate link).

Reading the Blogger documentation, there are no specific instructions for configuring your Namecheap domain so that Blogger can use your new custom domain. Here is how to do it.

  1. Log into your Namecheap account.
  2. Go to My Account > Manage Domains
  3. Select your new domain.
  4. Under Host Management, select URL Forwarding
  5. Change the row with www by changing the text field to contain ghs.google.com (leave Record Type as CNAME)
  6. Click Save Changes
  7. Follow the rest of the instructions under Update Your Blogger Settings: How do I use a custom domain name on my blog?

Tuesday, December 7, 2010

Enabling virtualization support for Dell Latitude E6400

At work, I have a Dell Latitude E6400 laptop running Windows XP (32-bit). In order to run 64-bit VirtualBox machines, I needed to enable virtualization support. Here’s how to do this:

1. When booting up, select F12
2. Select BIOS setup
3. In the Virtualization Support tree:
    - Virtualization, select Enable, Apply
    - VT for Direct I/O Access, select Enable, Apply
    - Trusted Execution, do not select Enable
4. Select Exit to continue booting into Windows.

Friday, December 3, 2010

How to share a folder from host to guest in VirtualBox

This is documented quite well in the VirtualBox documentation. But I’ve made my own notes here since it’s a task I do often enough, but sometimes still forget.


  1. With the guest machine turned off, open the Settings dialog.
  2. Go to the Shared Folders tab.
  3. Click on the Add Shared Folders button.
  4. Browse for the directory you want to share, and give it a name. In my case, I called the share work and I selected c:\work on my host’s filesystem. Click OK.
  5. Restart your guest machine.
  6. Now you need to mount this shared folder in the guest OS.


In Windows, the share is always named \\vboxsrv\sharename. You can navigate to this via My Network Places. You could also use the net command:

net use q: \\vboxsrv\sharename

In Linux, you use the mount command:

mount -t vboxsf sharename mountpoint

e.g.,

mount -t vboxsf work /mnt/work

Friday, February 12, 2010

How to install NumPy on Solaris

I've been working with NumPy (Numerical Python module) on Solaris, and now have a set of steps to install this and get this working. Note that this is the very basic installation of NumPy and doesn't contain additional libraries like BLAS or LAPACH. You can read more about installing these at the NumPy building instructions page. These instructions are to help you get all of the prerequisites installed and a basic build of NumPy going.

First you will need to install a Fortran compiler. You can download the Sun Studio 12u1 and put it in /tmp. Then:

cd /tmp
bzcat SunStudio12u1-SunOS-x86-tar-ML.tar.bz2 | /bin/tar -xf -
cd SunStudio12u1-SunOS-x86-tar-ML
./SunStudio12u1-SunOS-x86-tar.sh --accept-license
mv sunstudio12.1/ /opt
export PATH=/usr/ccs/bin:/opt/sunstudio12.1/bin:$PATH

From sunfreeware.com, download and install python-2.6.2-sol10-x86-local.gz to /tmp as well. Unpack and install:

gunzip python-2.6.2-sol10-x86-local.gz
pkgadd -d python-2.6.2-sol10-x86-local

Also, similarly install libiconv-1.13.1-sol10-x86-local.gz, gcc-3.4.6-sol10-x86-local.gz, libgcc-3.4.6 and openssl-0.9.8l. When you get to installing libgcc, you get warnings about overwriting existing libraries. This is okay, so, just say yes to overwrite them.

Add the following to /root/.bashrc as well as your current shell:

export PATH=/usr/local/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

Then with these new settings:

cd /tmp
svn co http://svn.scipy.org/svn/numpy/trunk numpy # (if svn is not installed, then svn co on another machine and scp -r to copy it to /tmp)
cd numpy
python setup.py build --fcompiler=sun
python setup.py install

To test the installation, create a file called /tmp/test.py with:

import numpy

Then test that it works. You should get no import errors:

python test.py

Monday, January 11, 2010

I've switched to iTerm

I was using GNU screen for a while. I liked having only one window open for multiple sessions in Mac OS X. However, setting up the configuration is not an easy feat. In recent months, however, I've switched to iTerm. I was against a tabbed command line environment because I didn't want to use the mouse to change tabs. But in iTerm you can cycle through the tabs with Command-arrow. You can get iTerm from their sourceforge page. I know that screen can do much more than iTerm, but I'm not really using those features, so for now iTerm will suffice.

Monday, January 4, 2010

Install numpy 2.6 on Snow Leopard (Mac OS X 10.6) from source

Here are the problems I was facing with using NumPy with Python 2.6.1 on Mac OS X 10.6.2 (Snow Leopard). This might be a bit convoluted, but bear with me.

First, in order to call STAF from a Python script, I built PYSTAF.so from source. However, the only way to get it to work is to run Python in 32-bit mode:

export VERSIONER_PYTHON_PREFER_32_BIT=yes

This then caused problems with NumPy:

Python 2.6.1 (r261:67515, Jul  7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
Traceback (most recent call last):
File "", line 1, in
File "/Library/Python/2.6/site-packages/numpy-1.4.0.dev7542-py2.6-macosx-10.6-universal.egg/numpy/__init__.py", line 130, in
import add_newdocs
File "/Library/Python/2.6/site-packages/numpy-1.4.0.dev7542-py2.6-macosx-10.6-universal.egg/numpy/add_newdocs.py", line 9, in
from lib import add_newdoc
File "/Library/Python/2.6/site-packages/numpy-1.4.0.dev7542-py2.6-macosx-10.6-universal.egg/numpy/lib/__init__.py", line 4, in
from type_check import *
File "/Library/Python/2.6/site-packages/numpy-1.4.0.dev7542-py2.6-macosx-10.6-universal.egg/numpy/lib/type_check.py", line 8, in
File "/Library/Python/2.6/site-packages/numpy-1.4.0.dev7542-py2.6-macosx-10.6-universal.egg/numpy/core/__init__.py", line 8, in
import numerictypes as nt
File "/Library/Python/2.6/site-packages/numpy-1.4.0.dev7542-py2.6-macosx-10.6-universal.egg/numpy/core/numerictypes.py", line 737, in
_typestr[key] = empty((1,),key).dtype.str[1:]
ValueError: array is too big.

According to Numpy Ticket #1221 (http://projects.scipy.org/numpy/ticket/1221) this is fixed in revision 7793. However, I was trying to use r7542 as was installed by the SciPy SuperPack installer (http://macinscience.org/?page_id=6).

Still following along? Okay, so here's how I think I resolved my issues. I built and installed NumPy using the instructions on HyperJeff's blog: http://blog.hyperjeff.net/?p=160

First we need to move old versions of Numpy. In my case:

sudo mv /System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/numpy \
/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/numpy_APPLE_DEFAULT
sudo mv /Library/Python/2.6/site-packages/numpy-1.4.0.dev7542-py2.6-macosx-10.6-universal.egg \
/Library/Python/2.6/site-packages/numpy-1.4.0.dev7542-py2.6-macosx-10.6-universal.egg-OLD

Then set the environment, build and install.

export MACOSX_DEPLOYMENT_TARGET=10.6
export CFLAGS="-arch i386 -arch x86_64"
export FFLAGS="-arch i386 -arch x86_64"
export LDFLAGS="-Wall -undefined dynamic_lookup -bundle -arch i386 -arch x86_64"
cd ~/tmp
svn co http://svn.scipy.org/svn/numpy/trunk numpy
cd numpy
python setup.py build --fcompiler=gnu95
sudo python setup.py install

If this doesn't work for you, you might want to run through all of the steps in HyperJeff's page above. I probably had much of the prerequisites from previously running the SciPy SuperPack installer.

Wednesday, October 14, 2009

Simple flashcard perl script

I think my friend Dan gave me a perl script a long time ago that was for Spanish flashcards. I have no idea where that is nowadays, but I decided to whip out a simple script. I think it is way simpler than Dan's version. There is no "grading" in this script. That is, there is no comparison with the answer you provide and the answer in the file. All you need to do to use this is to create a flashcard file. Each row contains the question and answer separated by a pipe.