To: jain%erlang.DEC@decwrl.dec.com (Raj Jain, LKG1-2/A19, DTN: 226-7642) Cc: ietf@gateway.mitre.org Subject: Re: Your congestion scheme In-Reply-To: Your message of 03 Nov 87 12:51:00 GMT. Date: Mon, 16 Nov 87 06:03:29 PST From: Van Jacobson Raj, Thanks for the note. I hope you'll excuse my taking the liberty of cc'ing this reply to the ietf: At the last meeting there was a great deal of interest in your congestion control scheme and our adaptation of it. > I am curious to know what values of increase amount and decrease > factor did you use. In our previous study on congestion using > timeout (CUTE scheme, IEEE JSAC, Oct 1986), we had found that the > decrease factor should be small since packet losses are > expensive. In fact, we found that a decrease factor of zero > (decreasing to one) was the best. We use .5 for the decrease factor and 1 for the increase factor. We also use something very close to CUTE (Mike Karels and I were behind on our journal reading so we independently rediscovered the algorithm and called it slow-start). Since we use a lost packet as the "congestion experienced" indicator, the CUTE algorithm and the congestion avoidance algorithm (BTW, have you picked a name for it yet?) get run together, even though they are logically distinct. The sender keeps two state variables for congestion control: a congestion window, cwnd, and a threshhold size, ssthresh, to switch between the two algorithms. The sender's output routine sends the minimum of cwnd and the window advertised by the receiver. The rest of the congestion control sender code looks like: On a timeout, record half the current window size as "ssthresh" (this is the multiplicative decrease part of the congestion avoidance algorithm), then set the congestion window to 1 packet and call the output routine (this initiates slowstart/CUTE). When new data is acked, the sender does something like if (cwnd < ssthresh) // if we're still doing slowstart cwnd += 1 packet // open the window exponentially else cwnd += 1/cwnd // otherwise do the Congestion // Avoidance increment-by-1 Notice that our variation of CUTE opens the window exponentially in time, as opposed to the linear scheme in your JSAC article. We looked at a linear scheme but were concerned about the performance hit on links with a large bandwidth-delay product (ie., satellite links). An exponential builds up fast enough to accomodate any bandwidth-delay and our testing showed no more packet drops with exponential opening that with linear opening. (My model of what slowstart is doing -- starting an ack "clock" to meter out packets -- suggested that there should be no loss-rate difference between the two policies). The reason for the 1/2 decrease, as opposed to the 7/8 you use, was the following hand-waving: When a packet is dropped, you're either starting (or restarting after a drop) or steady-state sending. If you're starting, you know that half the current window size "worked", ie., that a window's worth of packets were exchanged with no drops (slowstart guarantees this). Thus on congestion you set the window to the largest size that you know works then slowly increase the size. If the connection is steady-state running and a packet is dropped, it's probably because a new connection started up and took some of your bandwidth. We usually run our nets with rho <= .5 so it's probable that there are now exactly two conversations sharing the bandwidth. Ie., that you should reduce your window by half because the bandwidth available to you has been reduced to half. And, if there are more than two conversations sharing the bandwidth, halving your window is conservative -- and being conservative at high traffic intensities is probably wise. Obviously, this large decrease term is accounting for the high cost of our "congestion experienced" indicator compared to yours -- a dropped packet vs. a bit in the ack. If the DEC bit were available, the preceding "logic" should be ignored. But I wanted tcp congestion avoidance that we could deploy immediately and incrementally, without adding code to the hundreds of Internet gateways. So using dropped packets seemed like the only choice. And, in system terms, the cost isn't that high: Currently packets are dropped only when a large queue has formed. If we had a bit to force senders to reduce their windows, we'd still be stuck with the queue since we'd still be running the bottleneck at 100% utilization so there's no excess bandwidth available to dissipate the queue. If we toss a packet, a sender will shut up for 2 rtt, exactly the time we need to empty the queue (in the ususal case). If that sender restarts with the correct window size the queue won't reform. Thus we've reduced the delay to minimum without the system losing any bottleneck bandwidth. The 1-packet increase has less justification that the .5 decrease. In fact, it's almost certainly too large. If the algorithm converges to a window size of w, you get O(w^2) packets between drops with an additive increase policy. We were shooting for an average drop rate of <1% and found that on the Arpanet (the worst case of the 4 networks we tested), windows converged to 8 - 12 packets. This yields 1 packet increments for a 1% average drop rate. But, since we've done nothing in the gateways, the window we converge to is the maximum the gateway can accept without dropping packets. I.e., in the terms you used, we are just to the left of the cliff rather than just to the right of the knee. If we now fix the gateways so they start dropping packets when the queue gets pushed past the knee, our increment will be much too agressive and we'll have to drop it by at least a factor of 4 (since all my measurements on an unloaded Arpanet or Milnet place their "pipe size" at 4-5 packets). Mike and I have talked about a second order control loop to adaptively determine the appropriate increment to use for a path (there doesn't seem to be much need to adjust the decrement). It looks trivial to implement such a loop (since changing the increment affects the frequency of oscillation but not the mean window size, the loop would affect rate of convergence but not convergence and (first-order) stability). But we think 2nd order stuff should wait until we've spent some time on the 1st order part of the algorithm for the gateways. I'm tired and probably no longer making sense. I think I'll head home and get some sleep. Hope to hear from you soon. Cheers. - Van