From van@lbl-csam.arpa Mon Nov  2 01:27:39 1987
Posted-Date: Mon, 02 Nov 87 01:23:01 PST
Received-Date: Mon, 2 Nov 87 01:27:24 PST
Received: from GATEWAY.MITRE.ORG by venera.isi.edu (5.54/5.51)
        id AA13456; Mon, 2 Nov 87 01:27:24 PST
Return-Path: <van@lbl-csam.arpa>
Received: by gateway.mitre.org (5.54/SMI-2.2)
        id AA05906; Mon, 2 Nov 87 04:24:58 EST
Received: by lbl-csam.arpa (5.58/1.18)
        id AA18606; Mon, 2 Nov 87 01:23:03 PST
Message-Id: <8711020923.AA18606@lbl-csam.arpa>
To: gross@gateway.mitre.org, stine@gateway.mitre.org
Cc: Jain%Erlang.DEC@decwrl.dec.com, ietf@gateway.mitre.org
Subject: Re: IETF mtg and Congestion Control/Performance Working Group 
In-Reply-To: Your message of Tue, 20 Oct 87 08:01:19 EDT.
Date: Mon, 02 Nov 87 01:23:01 PST
From: Van Jacobson <van@lbl-csam.arpa>
Status: R

Phill and Bob -

Sorry to be sending this off so late.  I hope it gets to you
before the IETF meeting ends.  I have to teach a class Tuesday
morning.  If I can get a plane out Tuesday afternoon, I'll try to
be in Boulder for the Wednesday sessions. 

I wanted to get IETF feedback on some new congestion 
control stuff Mike Karels and I have put into TCP.  We hope to
be distributing new 4bsd network code shortly and, if Mike is
successful in changing the copyright, the code should receive
a wide distribution and be easy to redistribute.  I have sort
of a vague master-plan for congestion control and the new
tcp reflects part of it.  It would be nice if the IETF could
take potshots at what we've done so far and where we might go
with it.

The most recent change we've made is to implement a congestion
avoidance, dynamic window scheme for tcp very much like Raj Jain's
DEC-TR-506 proposal.  I should explain that because there may
be some confusion between the 'DEC bit' in ISO 8473 and the
overall congestion avoidance scheme.  As I understand it, Jain's
scheme has two separate pieces:  1) A method of detecting that
congestion exists along the path (the 'congestion experienced'
bit in ISO 8473) and 2) an algorithmic modification of the sender's
window depending on whether or not congestion is experienced.

We replaced (1) with an estimator that uses lost packets to
indicate "congestion experienced".  I have several reasons for
preferring packet loss as a congestion indicator rather than
using a new bit in the packet but the major reason is that the
congestion control code can be deployed and started working
incrementally and immediately: no modifications need to be made
to the gateways (or even the receiving tcp's).  Of course,
gateway modifications will help the new algorithm (e.g., a
gateway algorithm along the lines of fair-queuing or Dave Mill's
preemption).  But they aren't necessary and they can be done
incrementally: large gains in performance could come from just
fixing a few bottleneck gateways.  (The other nice thing about
using packet loss is that the same mechanism that lets a gateway
signal a new tcp helps it deal with overload from an old, broken
tcp). 

I don't think we changed the window algorithm in (2) at all (I'm
not sure of this because I haven't received a copy of the DEC
report -- I'm basing this on the presentation Raj gave at the
Boston IETF meeting): We follow the same multiplicative decrease
/ additive increase scheme on congestion experienced / not
experienced.  This isn't an accident.  During the Boston
presentation, it hit me that this was the only scheme that was
guaranteed to converge for an arbitrary, first order linear
system (i.e., for an arbitrary traffic distribution and topology)
and the optimal control equations follow directly from the
equation describing the system (I have since found a couple of
references supporting this and I'm sure there are similar proofs
in the DEC paper). 

The algorithm added one new state variable and four lines of code
to TCP (Mike was sanguine about the new code but the new variable
hurt -- we're down to two free bytes in the tcpcb).  As we
currently have the algorithm tuned, it converges to a loss rate
of .1 to .5%.  I have run a lot of tests looking at fairness,
stability and rate of convergence: everything looks great (except
fairness -- that's hard to do at the endpoints).  For example, I
fired up 8 simultaneous ftp's on 8 different machines, each ftp
using a 16KB (32 packet) window.  All the traffic was fed through
our poor Milnet gateway (which would allocate only 16 packets of
buffer, total, for all the ftp's since they were all destined for
hosts gatewayed by ucbvax).  Even though the demand exceeded the
gateway capacity by 1600%, all the connections "learned" the
available capacity in just 5 round trip times and the loss rate
settled down to .5% (the loss rate is due to the algorithm
"testing" the path to see if, say, some other connection has
closed down and freed up some more bandwidth.  You can make the
loss arbitrarily small but you increase the time it takes a
connection to learn "good news".  We thought something around 1%
was a good tradeoff between bandwidth lost to retransmissions and
bandwidth lost to underestimating the window.)

All the tests have worked so well that we're thinking it's time
to put tcp on the back burner and start looking at gateway
algorithms.  I think fair-queuing, combined with some cleverness
in figuring out when to drop packets and which to drop, would
be a workable algorithm.  But I think we can do things that are
a lot simpler:  I worry that fair-queuing requires the gateway
to know something about the transport protocols (something I
think we should avoid since there are several new transport
protocols on the horizon and it will be a lot of work to keep
gateway implementations current with the protocol mix) and
fair queuing requires a lot of state in the gateways (something
we should avoid to make the next generation packet switch -
the state maintenance adds a lot to the packet processing time
and the space used for end-to-end state could probably be better
used as packet buffers or routing cache).  I have some "random"
gateway algorithms that I think would do as good a job for
congestion control as fair-queuing but require no state and
have negligible per-packet cost.  (If my NSF proposal ever 
makes it through the LBL bureaucracy and out to Steve Wolfe,
it asks for funding to simulate, then prototype and test these
gateway algorithms.)

That's about all that's been happening here over the past couple
of months.  Oh, there's one other encouraging note: Keith Sklower
at Berkeley has ported all the tcp algorithms (timer stuff, slow
start, fast retransmit, dynamic window) to the 4.3bsd XNS
Sequenced Packet Protocol implementation.  He's just started
testing but Friday he reported that the new code improved
throughput from a Sun 3/50 XNS client to an (unmodified) Xerox
fileserver by 50% -- 16KBS to 24KBS.  (I thought this was neat
because the algorithms are really intended for a long haul net. 
It's nice to see them making a big improvement on a local net). 
Since everything went into SPP pretty easily, it might bode well
for applying all this stuff to TP4 (or whatever ISO sticks us
with). 

Hope to see you Wednesday.

 - Van