From van@lbl-csam.arpa Mon Nov 2 01:27:39 1987 Posted-Date: Mon, 02 Nov 87 01:23:01 PST Received-Date: Mon, 2 Nov 87 01:27:24 PST Received: from GATEWAY.MITRE.ORG by venera.isi.edu (5.54/5.51) id AA13456; Mon, 2 Nov 87 01:27:24 PST Return-Path: Received: by gateway.mitre.org (5.54/SMI-2.2) id AA05906; Mon, 2 Nov 87 04:24:58 EST Received: by lbl-csam.arpa (5.58/1.18) id AA18606; Mon, 2 Nov 87 01:23:03 PST Message-Id: <8711020923.AA18606@lbl-csam.arpa> To: gross@gateway.mitre.org, stine@gateway.mitre.org Cc: Jain%Erlang.DEC@decwrl.dec.com, ietf@gateway.mitre.org Subject: Re: IETF mtg and Congestion Control/Performance Working Group In-Reply-To: Your message of Tue, 20 Oct 87 08:01:19 EDT. Date: Mon, 02 Nov 87 01:23:01 PST From: Van Jacobson Status: R Phill and Bob - Sorry to be sending this off so late. I hope it gets to you before the IETF meeting ends. I have to teach a class Tuesday morning. If I can get a plane out Tuesday afternoon, I'll try to be in Boulder for the Wednesday sessions. I wanted to get IETF feedback on some new congestion control stuff Mike Karels and I have put into TCP. We hope to be distributing new 4bsd network code shortly and, if Mike is successful in changing the copyright, the code should receive a wide distribution and be easy to redistribute. I have sort of a vague master-plan for congestion control and the new tcp reflects part of it. It would be nice if the IETF could take potshots at what we've done so far and where we might go with it. The most recent change we've made is to implement a congestion avoidance, dynamic window scheme for tcp very much like Raj Jain's DEC-TR-506 proposal. I should explain that because there may be some confusion between the 'DEC bit' in ISO 8473 and the overall congestion avoidance scheme. As I understand it, Jain's scheme has two separate pieces: 1) A method of detecting that congestion exists along the path (the 'congestion experienced' bit in ISO 8473) and 2) an algorithmic modification of the sender's window depending on whether or not congestion is experienced. We replaced (1) with an estimator that uses lost packets to indicate "congestion experienced". I have several reasons for preferring packet loss as a congestion indicator rather than using a new bit in the packet but the major reason is that the congestion control code can be deployed and started working incrementally and immediately: no modifications need to be made to the gateways (or even the receiving tcp's). Of course, gateway modifications will help the new algorithm (e.g., a gateway algorithm along the lines of fair-queuing or Dave Mill's preemption). But they aren't necessary and they can be done incrementally: large gains in performance could come from just fixing a few bottleneck gateways. (The other nice thing about using packet loss is that the same mechanism that lets a gateway signal a new tcp helps it deal with overload from an old, broken tcp). I don't think we changed the window algorithm in (2) at all (I'm not sure of this because I haven't received a copy of the DEC report -- I'm basing this on the presentation Raj gave at the Boston IETF meeting): We follow the same multiplicative decrease / additive increase scheme on congestion experienced / not experienced. This isn't an accident. During the Boston presentation, it hit me that this was the only scheme that was guaranteed to converge for an arbitrary, first order linear system (i.e., for an arbitrary traffic distribution and topology) and the optimal control equations follow directly from the equation describing the system (I have since found a couple of references supporting this and I'm sure there are similar proofs in the DEC paper). The algorithm added one new state variable and four lines of code to TCP (Mike was sanguine about the new code but the new variable hurt -- we're down to two free bytes in the tcpcb). As we currently have the algorithm tuned, it converges to a loss rate of .1 to .5%. I have run a lot of tests looking at fairness, stability and rate of convergence: everything looks great (except fairness -- that's hard to do at the endpoints). For example, I fired up 8 simultaneous ftp's on 8 different machines, each ftp using a 16KB (32 packet) window. All the traffic was fed through our poor Milnet gateway (which would allocate only 16 packets of buffer, total, for all the ftp's since they were all destined for hosts gatewayed by ucbvax). Even though the demand exceeded the gateway capacity by 1600%, all the connections "learned" the available capacity in just 5 round trip times and the loss rate settled down to .5% (the loss rate is due to the algorithm "testing" the path to see if, say, some other connection has closed down and freed up some more bandwidth. You can make the loss arbitrarily small but you increase the time it takes a connection to learn "good news". We thought something around 1% was a good tradeoff between bandwidth lost to retransmissions and bandwidth lost to underestimating the window.) All the tests have worked so well that we're thinking it's time to put tcp on the back burner and start looking at gateway algorithms. I think fair-queuing, combined with some cleverness in figuring out when to drop packets and which to drop, would be a workable algorithm. But I think we can do things that are a lot simpler: I worry that fair-queuing requires the gateway to know something about the transport protocols (something I think we should avoid since there are several new transport protocols on the horizon and it will be a lot of work to keep gateway implementations current with the protocol mix) and fair queuing requires a lot of state in the gateways (something we should avoid to make the next generation packet switch - the state maintenance adds a lot to the packet processing time and the space used for end-to-end state could probably be better used as packet buffers or routing cache). I have some "random" gateway algorithms that I think would do as good a job for congestion control as fair-queuing but require no state and have negligible per-packet cost. (If my NSF proposal ever makes it through the LBL bureaucracy and out to Steve Wolfe, it asks for funding to simulate, then prototype and test these gateway algorithms.) That's about all that's been happening here over the past couple of months. Oh, there's one other encouraging note: Keith Sklower at Berkeley has ported all the tcp algorithms (timer stuff, slow start, fast retransmit, dynamic window) to the 4.3bsd XNS Sequenced Packet Protocol implementation. He's just started testing but Friday he reported that the new code improved throughput from a Sun 3/50 XNS client to an (unmodified) Xerox fileserver by 50% -- 16KBS to 24KBS. (I thought this was neat because the algorithms are really intended for a long haul net. It's nice to see them making a big improvement on a local net). Since everything went into SPP pretty easily, it might bode well for applying all this stuff to TP4 (or whatever ISO sticks us with). Hope to see you Wednesday. - Van