Saturday, January 14, 2017

Discrete histogram

The usual histogram is a graphical representation of the distribution of numerical data. It also implies that you have regular intervals for the resulting distribution. What happens if you are looking for distribution over discrete elements, like day of the week, month, or any possible word...

Here is an example.
  • How many coins are gold or silver?
  • How many coins from each country we have?
  • ... year?
   gold     1    1986  USA                 American Eagle
   gold     1    1908  Austria-Hungary     Franz Josef 100 Korona
   silver  10    1981  USA                 ingot
   gold     1    1984  Switzerland         ingot
   gold     1    1979  RSA                 Krugerrand
   gold     0.5  1981  RSA                 Krugerrand
   gold     0.1  1986  PRC                 Panda
   silver   1    1986  USA                 Liberty dollar
   gold     0.25 1986  USA                 Liberty 5-dollar piece
   silver   0.5  1986  USA                 Liberty 50-cent piece
   silver   1    1987  USA                 Constitution dollar
   gold     0.25 1987  USA                 Constitution 5-dollar piece
   gold     1    1988  Canada              Maple Leaf

Here is a rather general solution that uses an undocumented feature that is rather useful.

#!/bin/awk -f
BEGIN {
  if (!col) col=1
}

{
  counts[$(col)]++;
  total++;
}

END {
  for (v in counts) {
    printf "%s %.0f %f \n", v, counts[v], counts[v]/0.01/total ;
  }
}

Simply running the script over the file with the data (coins.txt) will calculate the distribution over the first column.

$ ./histogram-discrete.awk coins.txt
silver 4 30.769231
gold   9 69.230769

This finds that there are 4 silver coins and 9 gold in the file. The last column is the percentage.

Now, for the trick (special thanks to my colleague Douglas Scofield for introducing me to this trick). If you add col=4 before the name of the input file it will be interpreted as variable assignment (providing that you do not have a file with this name ;-)). Here is the result - nothing is changed in the script.

$ ./histogram-discrete.awk col=4 coins.txt
Switzerland     1  7.692308
Canada          1  7.692308
Austria-Hungary 1  7.692308
PRC             1  7.692308
RSA             2 15.384615
USA             7 53.846154

The official documentation says that you should write -v col=4 or --assign=col=4 so, keep this in mind. Notice that the output has no particular order. In the latest awk versions there is a way to force a particular order to the output but this is something I leave to you. You can always pipe the output via sort.


This solution and example are edited copy from my awk workshop material online.

No comments:

Post a Comment