# Quick stats on a stream of values in the console

I often find myself `grep`-ing for information in system or application log files. And often, by combining pipes, I end up generating a flow of values that is sometimes difficult to interpret.

In this post I'll show you a quick-and-dirty but handy solution to get basic statistical quantities from a UNIX text stream of values.

There is the `bash` function using `awk` to put in your .bashrc:

``````stats () { # --no-header | awk '{print \$3}'
[ "\$1" = "--no-header" ] || printf "%-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s\n"\
1-SUM 2-COUNT 3-MEAN 4-STD_DEV 5-MIN 6-TP01 7-TP10 8-MEDIAN 9-TP90 10-TP99 11-MAX
sort -n | awk 'BEGIN{n=0;sum=0;mean=0;M2=0}\
/^[^#]/{a[n++]=\$1;sum+=\$1;delta=\$1-mean;mean+=delta/n;M2+=delta*(\$1-mean)}\
function tp(ratio){i=n*ratio-1;if(i<0){return a;}else{return a[int(i)];}}
END{unbiased_variance=M2/(n-1);
std_dev=sqrt(unbiased_variance);
if((n%2)==1){median=a[int(n/2)];}\
else{median=(a[n/2]+a[n/2-1])/2;}\
printf "%-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s\n",\
sum,n,mean,std_dev,a,tp(.01),tp(.1),median,tp(.9),tp(.99),a[n-1]}'
}
``````

Now, a very basic example:

``````# seq 1 10 | stats
1-SUM      2-COUNT    3-MEAN     4-STD_DEV  5-MIN      6-TP01     7-TP10     8-MEDIAN   9-TP90     10-TP99    11-MAX
55         10         5.5        3.02765    1          1          1          5.5        9          9          10
``````

Then, what about the min & max hop length (in ms) when pinging google.com ?

``````# traceroute google.com | sed '1d;/\*/d' | awk '{print \$(NF-1)}'
1-SUM      2-COUNT    3-MEAN     4-STD_DEV  5-MIN      6-TP01     7-TP10     8-MEDIAN   9-TP90     10-TP99    11-MAX
377.292    9          41.9213    14.8838    3.101      3.101      3.101      46.556     48.472     48.472     52.235
``````

Finally, let's look at the sizes (in bytes) of all the files in /var/log:

``````# stat -c '%s' /var/log/* | stats --no-header
8129626    119        68316.2    201958     0          0          218        5140       168528     704151     1693462
``````

Of course, this little function is not meant to be used in production, where Python pandas library would do a far better job.

Inspiration taken from a StackOverflow answer by Bruced Ediger and this Wikipedia online algorithm to compute variance.

EDIT [22/09/2014] : I found two very useful existing commands that do similar things: `ministat` and `tinystat` written in Go.

EDIT [8/12/2014] : Another great one : `csvstat` (install it with `pip install csvkit`).