To: markl@PTT.LCS.MIT.EDU
Subject: Re: interpacket arrival variance and mean 
In-reply-to: Your message of Mon, 08 Jun 87 12:31:33 EDT.
Date: Mon, 15 Jun 87 06:08:01 PDT
From: Van Jacobson <van@lbl-csam.arpa>

Mark -

I'm working on long paper about transport protocol timers (models,
estimation, implementations, common implementation mistakes, etc.).
This has me all primed on the subject so attached is a long-winded
explanation and example C code for estimating the mean and variance
of a series of measurements.

Though I know your application isn't tcp, I'm going to use a new
tcp retransmit timer algorithm as an example.  The algorithm
illustrates the method and, based on 3 months of experiment, is
much better than the algorithm in rfc793 (though it isn't as good
as the algorithm I talked about at IETF and am currently debugging). 

Theory
------
You're probably familiar with the rfc793 algorithm for estimating
the mean round trip time.  This is the simplest example of a
class of estimators called "recursive prediction error" or
"stochastic gradient" algorithms.  In the past 20 years these
algorithms have revolutionized estimation and control theory so
it's probably worth while to look at the rfc793 estimator in some
detail. 

Given a new measurement, Meas, of the rtt, tcp updates an
estimate of the average rtt, Avg, by

	Avg <- (1 - g) Avg  +  g Meas

where g is a "gain" (0 < g < 1) that should be related to the
signal- to-noise ratio (or, equivalently, variance) of Meas. 
This makes a bit more sense (and computes faster) if we rearrange
and collect terms multiplied by g to get

	Avg <- Avg + g (Meas - Avg)

We can think of "Avg" as a prediction of the next measurement. 
So "Meas - Avg" is the error in that prediction and the
expression above says we make a new prediction based on the old
prediction plus some fraction of the prediction error.  The
prediction error is the sum of two components: (1) error due to
"noise" in the measurement (i.e., random, unpredictable effects
like fluctuations in competing traffic) and (2) error due to a
bad choice of "Avg".  Calling the random error RE and the
predictor error PE,

	Avg <- Avg + g RE + g PE

The "g PE" term gives Avg a kick in the right direction while the
"g RE" term gives it a kick in a random direction.  Over a number
of samples, the random kicks cancel each other out so this
algorithm tends to converge to the correct average.  But g
represents a compromise: We want a large g to get mileage out of
the PE term but a small g to minimize the damage from the RE
term.  Since the PE terms move Avg toward the real average no
matter what value we use for g, it's almost always better to use
a gain that's too small rather than one that's too large. 
Typical gain choices are .1 - .2 (though it's always a good
idea to take long look at your raw data before picking a gain). 

It's probably obvious that Avg will oscillate randomly around
the true average and the standard deviation of Avg will be
something like g sdev(Meas).  Also that Avg converges to the
true average exponentially with time constant 1/g.  So
making g smaller gives a stabler Avg at the expense of taking
a much longer time to get to the true average.

[My paper makes the preceding hand-waving a bit more rigorous
but I assume no one cares about rigor.  If you do care, check
out almost any text on digital filtering or estimation theory.]

If we want some measure of the variation in Meas, say to compute
a good value for the tcp retransmit timer, there are lots of
choices.  Statisticians love variance because it has some nice
mathematical properties.  But variance requires squaring (Meas -
Avg) so an estimator for it will contain two multiplies and a
large chance of integer overflow.  Also, most of the applications
I can think of want variation in the same units as Avg and Meas,
so we'll be forced to take the square root of the variance to use
it (i.e., probably at least a divide, multiply and two adds). 

A variation measure that's easy to compute, and has a nice,
intuitive justification, is the mean prediction error or mean
deviation.  This is just the average of abs(Meas - Avg). 
Intuitively, this is an estimate of how badly we've blown our
recent predictions (which seems like just the thing we'd want to
set up a retransmit timer).  Statistically, standard deviation
(= sqrt variance) goes like sqrt(sum((Meas - Avg)^2)) while mean
deviation goes like sum(sqrt((Meas - Avg)^2)).  Thus, by the
triangle inequality, mean deviation should be a more "conservative"
estimate of variation. 

If you really want standard deviation for some obscure reason,
there's usually a simple relation between mdev and sdev.  Eg.,
if the prediction errors are normally distributed, mdev =
sqrt(pi/2) sdev.  For most common distributions the factor
to go from sdev to mdev is near one (sqrt(pi/2) ~= 1.25).  Ie.,
mdev is a good approximation of sdev (and is much easier to
compute).

Practice
--------
So let's do estimators for Avg and mean deviation, Mdev.  Both
estimators compute means so we get two instances of the rfc793
algorithm:

	Err = abs (Meas - Avg)
	Avg <- Avg + g (Meas - Avg)
	Mdev <- Mdev + g (Err - Mdev)

If we want to do the above fast, it should probably be done in
integer arithmetic.  But the expressions contain fractions (g < 1)
so we'll have to do some scaling to keep everything integer.  If
we chose g so that 1/g is an integer, the scaling is easy.  A
particularly good choice for g is a reciprocal power of 2 
(ie., g = 1/2^n for some n).  Multiplying through by 1/g gives

	2^n Avg <- 2^n Avg + (Meas - Avg)
	2^n Mdev <- 2^n Mdev + (Err - Mdev)

To minimize round-off error, we probably want to keep the scaled
versions of Avg and Mdev rather than the unscaled versions.  If
we pick g = .125 = 1/8 (close to the .1 that rfc793 suggests) and
express all the above in C:

	Meas -= (Avg >> 3);
	Avg += Meas;
	if (Meas < 0)
		Meas = -Meas;
	Meas -= (Mdev >> 3);
	Mdev += Meas;

I've been using a variant of this to compute the retransmit timer
for my tcp.  It's clearly faster than the two floating point
multiplies that 4.3bsd uses and gives you twice as much information.
Since the variation is estimated dynamically rather than using
the wired-in beta of rfc793, the retransmit performance is also
much better: faster retransmissions on low variance links and
fewer spurious retransmissions on high variance links. 

It's not necessary to use the same gain for Avg and Mdev. 
Empirically, it looks like a retransmit timer estimate works
better if there's more gain (bigger g) in the Mdev estimator. 
This forces the timer to go up quickly in response to changes in
the rtt.  (Although it may not be obvious, the absolute value in
the calculation of Mdev introduces an asymmetry in the timer:
Because Mdev has the same sign as an increase and the opposite
sign of a decrease, more gain in Mdev makes the timer go up fast
and come down slow, "automatically" giving the behavior Dave
Mills suggests in rfc889.)

Using a gain of .25 on the deviation and computing the retransmit
timer, rto, as Avg + 2 Mdev, my tcp timer code looks like:

	Meas -= (Avg >> 3);
	Avg += Meas;
	if (Meas < 0)
		Meas = -Meas;
	Meas -= (Mdev >> 2);
	Mdev += Meas;
	rto = (Avg >> 3) + (Mdev >> 1);

Hope this helps.

  - Van