I often find myself grep
-ing for information in system or application log files. And often, by combining pipes, I end up generating a flow of values that is sometimes difficult to interpret.
In this post I'll show you a quick-and-dirty but handy solution to get basic statistical quantities from a UNIX text stream of values.
There is the bash
function using awk
to put in your .bashrc:
stats () { # --no-header | awk '{print $3}'
[ "$1" = "--no-header" ] || printf "%-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s\n"\
1-SUM 2-COUNT 3-MEAN 4-STD_DEV 5-MIN 6-TP01 7-TP10 8-MEDIAN 9-TP90 10-TP99 11-MAX
sort -n | awk 'BEGIN{n=0;sum=0;mean=0;M2=0}\
/^[^#]/{a[n++]=$1;sum+=$1;delta=$1-mean;mean+=delta/n;M2+=delta*($1-mean)}\
function tp(ratio){i=n*ratio-1;if(i<0){return a[0];}else{return a[int(i)];}}
END{unbiased_variance=M2/(n-1);
std_dev=sqrt(unbiased_variance);
if((n%2)==1){median=a[int(n/2)];}\
else{median=(a[n/2]+a[n/2-1])/2;}\
printf "%-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s\n",\
sum,n,mean,std_dev,a[0],tp(.01),tp(.1),median,tp(.9),tp(.99),a[n-1]}'
}
Now, a very basic example:
# seq 1 10 | stats
1-SUM 2-COUNT 3-MEAN 4-STD_DEV 5-MIN 6-TP01 7-TP10 8-MEDIAN 9-TP90 10-TP99 11-MAX
55 10 5.5 3.02765 1 1 1 5.5 9 9 10
Then, what about the min & max hop length (in ms) when pinging google.com ?
# traceroute google.com | sed '1d;/\*/d' | awk '{print $(NF-1)}'
1-SUM 2-COUNT 3-MEAN 4-STD_DEV 5-MIN 6-TP01 7-TP10 8-MEDIAN 9-TP90 10-TP99 11-MAX
377.292 9 41.9213 14.8838 3.101 3.101 3.101 46.556 48.472 48.472 52.235
Finally, let's look at the sizes (in bytes) of all the files in /var/log:
# stat -c '%s' /var/log/* | stats --no-header
8129626 119 68316.2 201958 0 0 218 5140 168528 704151 1693462
Of course, this little function is not meant to be used in production, where Python pandas library would do a far better job.
Inspiration taken from a StackOverflow answer by Bruced Ediger and this Wikipedia online algorithm to compute variance.
EDIT [22/09/2014] : I found two very useful existing commands that do similar things: ministat
and tinystat
written in Go.
EDIT [8/12/2014] : Another great one : csvstat
(install it with pip install csvkit
).