From van@helios.ee.lbl.gov  Tue Aug  9 08:45:45 1988
Posted-Date: Tue, 09 Aug 88 08:44:24 PDT
Received-Date: Tue, 9 Aug 88 08:45:45 PDT
Received: from helios.ee.lbl.gov by venera.isi.edu (5.54/5.51)
	id AA06244; Tue, 9 Aug 88 08:45:45 PDT
Received: by helios.ee.lbl.gov (5.59/s2.2)
	id AA08542; Tue, 9 Aug 88 08:44:25 PDT
Message-Id: <8808091544.AA08542@helios.ee.lbl.gov>
To: end2end-interest@venera.isi.edu
Subject: notes on quenching and gateway congestion control algorithms
Date: Tue, 09 Aug 88 08:44:24 PDT
From: Van Jacobson <van@helios.ee.lbl.gov>
Status: R

These are extracts from a message about the work I'm doing on
the gateway side of congestion control.  I don't know if the
original message was circulated and I apologize if you're seeing
this twice.  (The original message contained a desparate plea
for the NSF funding I hope will support this work.  I've deleted
it & some other irrelevancies.  In case it's not clear from the
context, "Dave" refers to Dave Mills.  He has a different point
of view on this stuff so remember that you're only getting my
(highly biased) views here.)

This is pretty vague but it's going to take a lot of staring
at gateway data to make it any more specific.  It would be nice
if you could point out the obvious (or non-obvious) mistakes 
before I get so much invested that I'm committed to defending
them.  (Hmm, that reminds me of a quote I was going to suggest
for the Host Requirements RFC:
	"Some mistakes are now mistakes,
	 others are still virtues."
from a poem whose title I forget by the Czech poet Miroslav Holub.)

As you can probably tell, it's late & I'm getting incoherent.
Sorry about that.

 - Van

--------------------------------------

 ...
What I remember from IETF is saying that, based on trace data of
conversations using Arpanet and Milnet gateways, a source quench
was preferentially delivered to the "wrong guy" -- to a conversation
using a small window or a small average bandwidth.  Dave asked
if I'd looked at traffic using NSFnet gateways and I said no --
most of the congestion happens at fast-to-slow transitions and
I didn't have access to an ethernet directly off to the
NSFnet backbone.  But I thought I'd said that there was no question
that the data would have been different -- the fuzzball software
goes to great lengths to deliver quenches and drops to the biggest
user.  Assuming Dave didn't make a coding error, quenches should
go to the right guy.  I don't see how the experiment would say
much of interest about random vs. selective preemption.

That's not to say I'm not interested in getting some NSFnet
trace data.  My self-organization model says that the Arpanet
dies quite easily under load because of the reliable delivery in
the subnet.  Arpanet-like oscillations shouldn't show up until a
much higher load in the pure-datagram NSFnet.  And the evolution
should be completely different:  The Arpanet made a very sudden
transition from random, uniform traffic to periodic, globally
organized traffic.  A large datagram network should make a much
slower transition and there should be structure (something like
period doubling bifurcations) along the route.

 ...

I brought up the quench data in the talk because

  a) It wasn't the way things were designed to work & I thought
     the "self-organization" model helped explain it.

  b) I've heard people say that the current "drop incoming packet
     if queue full" policy is equivalent to random preemption.  This
     isn't true -- Randomness is something you have to work for in
     a congested system.  Deterministic policies that rely on
     source randomness are guaranteed to work non-randomly and badly
     (in Mike O'Dell's phrase, "Nature abhors consistency").

  c) It was yet another example of why source quench is broken and
     should be deprecated.  Now, Dave may counter with "No, it's
     another example of why gateways should be fixed to generate
     SQ and preempt the way fuzzies do."  So, I'll counter this
     anticipated objection...

I'm not sure if I want to get into a battle over random vs.
selective preemption.  My model of what a gateway should do
isn't either.  I'm worried that any policy that involves
inspecting a gateway's queue won't scale well:  We keep
increasing the backbone trunk bandwidth but can't do anything
about the speed of light so the delay-bandwidth product will
continue to go up.  That means connections will be using larger
windows to fill the pipe.  Typically, a window's worth of data
from each connection ends up in the queue when there's a
downstream blockage.  Say you've got a 30KB queue.  Right now,
that means a queue-based policy should see traffic from 15
different connections (saying 56Kbs @ 200ms ~= 2KB).  That's a
large enough sample to make "fair" decisions.  But with the new
backbone you lose:  If you keep the queue the same size, your
horizon is only one connection (saying 1.2Mbs @ 200ms ~= 30KB)
and you can't possibly make a fair decision.  If you increase
the queue size, you'll spend more time scanning the queue to
make a decision.  Since we're going to be gateway-cpu limited
for the foreseeable future, this means the capacity will
decrease under load, a destablizing positive feedback.  And
there will be more "stored energy" (queue) available to drive
the 2nd order resonance effects that are inevitable in a
backbone -- another destablizing positive feedback.

I think we can do something that doesn't depend on having a
queue.  A queue represents an imbalance between the arrival and
departure processes at the gateway.  It should be possible to
construct time-series models describing the processes directly,
rather than working indirectly through an effect (the queue).
The fact that our protocols do reliable delivery (implying some
sort of acknowledgment loop) means the arrival process gets
coupled to the "inertia" of the network.  This suggests that,
for a large network, low order time-series models should
describe the processes (since the high-order, source dependent
terms get washed out by the network's damping).

In particular, preliminary data suggests that a 2nd order model
of arrivals per departure (i.e., feed the average interarrival
time since the last departure through a 2nd order lattice filter
at each departure) gives a good prediction of gateway load and
the coefficients seem to have a "physical" interpretation:  the
DC term is related to the average window used by currently
active connections, the linear term is related to rho and the
quadratic term is related to the period of the downstream
blocking (the amount of global "self organization").  These
things are exactly the inputs I want to feed a gateway
congestion control policy.  For example, the DC term is the
"horizon" needed to insure that all active connections are
considered by the policy.  Using fewer than this many packets
when deciding on a "fair" bandwidth allocation will result in
aliasing and unfair allocation a la the Arpanet.

Things are too preliminary to say much about the gateway policy
but I'll wave my hands:  The rho term says how badly you're
overloaded (after the 2nd order term is factored out since you
don't want to penalize your traffic for long timescale global
oscillation).  The DC term sets the response time:  At least
this many packets will go by before you see any effect from your
congestion control actions.  The model residuals are the
intrinsic variation in the system:  reduce the average queue
below the mean of the residuals and you risk starving the
bottleneck and reducing the system throughput.

Using packet drops as a congestion control signal, my "policy"
is a simple function of the above terms whose output is how
frequently you would have to drop packets to reduce the current
load to an acceptable level.  The inverse of this frequency is
the average number of packets between drops.  I compute a random
number uniformly distributed between 1 and twice this number,
forward that many packets, then drop one.  (Translate "drop" to
"set the DEC bit" if you prefer that signalling scheme).  The
reason for using a random number is to avoid dropping in phase
with any structure in the traffic.  If the number is uniformly
distributed, the drops (or "congestion control signal") are
guaranteed to be fair because the traffic "carries its own
distribution."  I.e., a connection using twice the resources of
all others is twice as likely to get hit by the drop.

Note that neither the load estimator nor the control policy rely
on a queue.  And the amount of computation per packet is
constant so there's no non-linear degradation with load.  And
there's effectively no state in the gateway.  And the gateway
doesn't need to look inside packets to make a fair allocation so
we're free to use new transport protocols.  And it slices, dices,
cures acne and baldness, ... (oops, sorry.  got carried away).

So, I think Dave's selective preemption is the best thing currently
implemented.  I think it may have horizon problems when scaled
up to the new backbone.  I think we'll be cpu limited in the new
backbone and, while selective preemption with a big enough queue
will make slightly better decisions than random preemption on the
same size queue, I suspect the cost in cpu cycles and stability
will out weigh the benefit.  And all the selective preemption
and fair queuing schemes I've seen force a gateway to snoop the
transport layer (otherwise you do stupid things like give my
workstation the same bandwidth allocation as CSNET-RELAY).  This
may make the transport changes that we all see coming harder to
implement.  Finally, without some sort of estimator, selective
preemption will get fooled by the long-term, downstream blockages
we see in the Arpanet (which have to eventually appear in the
NSFnet).  At best this will do congestion control in the wrong
place and lower system stability.  At worst it will pump up the
resonance & help drive the overall throughput to zero.

My scheme, of course, has none of these problems :-).

 ...

