To: tcp-ip@sri-nic.arpa Subject: Dynamic Congestion Avoidance / Control (long message) Date: Thu, 11 Feb 88 22:17:04 PST From: Van Jacobson A dozen people forwarded Phil Karn's question about TCP congestion control to me, usually with pithy subject lines like "how much longer till you publish something?". I do have three RFCs and two papers in various stages of preparation, but innate laziness combined with this semester's unexpectedly large teaching load means it will probably be late spring before anything gets finished. In the interim, here's a sort-of overview of our congestion control work. I should point out these algorithms are joint work with Mike Karels of UC Berkeley (I guess my name got stuck on things because I make the presentations while Mike is off fighting the battles on EGP or routing or domains or ...). I should also mention that there's not a whole lot that's original. In particular, the two most important algorithms are quite close to (prior) work done by DEC's Raj Jain. This was by accident in one case and deliberate in the other. This stuff has been discussed on the ietf and end2end lists (Phil participated in some of those discussions and was, in fact, one of the early beta testers for our new tcp -- I have this nagging suspicion I was set up). I've appended a couple of those mail messages. Mike and I have put a number (six, actually) of new algorithms into the 4bsd tcp. Our own measurements and the reports of our beta testers suggest that the result is quite good at dealing with current, congested conditions on the Internet. The various algorithms were developed and presented at different times over the past year and it may not be clear to anyone but the developers how, or if, the pieces relate. Most of the changes spring from one observation: The flow on a TCP connection (or ISO TP-4 or XNS SPP connection) should, by the design of the protocol, obey a `conservation of packets' principle. And, if this principle were obeyed, congestion collapse would become the exception rather than the rule. Thus congestion control turns into finding places where conservation is violated and fixing them. By `conservation of packets' I mean the following: If you look at the connection "in equilibrium", i.e., running stably with a full window of data in transit, you notice that the flow is what a physicist would call "conservative": A new packet isn't put into the network until an old packet leaves. Two scientific disciplines that deal with flow, hydrodynamics and thermodynamics, predict that systems with this conservation property should be extremely robust in the face of congestion. Observation of the Arpanet or NSFNet suggests that they were not particularly robust in the face of congestion. From whence comes the discrepancy? [Someone asked me for a simple explanation of why a conservative flow system should be stable under load. I've been trying to come up with one, thus far without success -- absolutely nothing in hydrodynamics admits to simple explanations. The shortest explanation is that the inhomogenous forcing terms in the Fokker-Planck equation vanish and the remaining terms are highly damped. But I don't suppose this means much to anyone (it doesn't mean much to me). My intuition is that conservation means there's never an `energy difference' between the outside world and the system that the system could use to `pump up' an instability to a state of collapse. Thus the only mechanism the system can use for self-destruction is to re-arrange its internal energy and create a difference large enough to break something. But entropy (or diffusion) always trys to erase large internal energy differences so collapse via these mechanisms is improbable. Possible, but improbable, at least until the load gets so outrageous that collective, non-ergodic effects start to dominate the overall behavior.] Packet conservation can fail three ways: 1) the connection doesn't get to equilibrium, or 2) a sender injects a packet before an old packet has exited, or 3) the equilibrium can't be reached because of resource limits along the path. (1) has to be from a connection starting or restarting after a packet loss. Another way to look at the conservation property is to say that the sender uses acks as a "clock" to strobe new packets into the network. Since the receiver can generate acks no faster than data packets can get through the network, the protocol is "self clocking" (an ack can't go in until a packet comes out, a packet can't go in until an ack comes out). Self clocking systems automatically adjust to bandwidth and delay variations and have a wide dynamic range (an important issue when you realize that TCP spans everything from 800 Mbit/sec Cray channels to 1200 bit/sec packet radio links). But the same thing that makes a self-clocked system stable when it's running makes it hard to get started -- to get data flowing you want acks to clock packets out but to get acks you have to have data flowing. To get the `clock' started, we came up with an algorithm called slow-start that gradually increases the amount of data flowing. Although we flatter ourselves that the design of this algorithm is quite subtle, the implementation is trivial -- 3 lines of code and one new state variable in the sender: 1) whenever you're starting or restarting after a loss, set your "congestion window" to 1 packet. 2) when you get an ack for new data, increase the congestion window by one packet. 3) when you send, send the minimum of the receiver's advertised window and the congestion window. (This is quite similar to Raj Jain's CUTE algorithm described in IEEE Transactions on Communications, Oct, '86, although we didn't know about CUTE at the time we were developing slowstart). Actually, the slow-start window increase isn't all that gradual: The window opening takes time proportional to log2(W) where W is the window size in packets. This opens the window fast enough to have a negligible effect on performance, even on links that require a very large window. And the algorithm guarantees that a connection will source data at a rate at most twice the maximum possible on the path. (Without slow-start, by contrast, when 10Mb ethernet hosts talk over the 56Kb Arpanet via IP gateways, the gateways see a window's worth of packets (8-16) delivered at 200 times the path bandwidth.) Once you can reliably start data flowing, problems (2) and (3) have to be addressed. Assuming that the protocol implementation is correct, (2) must represent a failure of sender's retransmit timer. A good round trip time estimator (the core of the retransmit timer) is the single most important feature of any protocol implementation that expects to survive heavy load. And it almost always gets botched. One mistake seems to be not estimating the variance of the rtt. >From queuing theory we know that rtt and rtt variation increase very quickly with load. Measuring the load by rho (the ratio of average arrival rate to average departure rate), the rtt goes up like 1/(1-rho) and the variation in rtt goes like 1/(1-rho)^2. To make this concrete, if the network is running at 75% of capacity (as the Arpanet was in last April's collapse), you should expect rtt to vary by a factor of 16 around its mean value. The RFC793 parameter "beta" accounts for rtt variation. The suggested value of "2" can adapt to loads of at most 30%. Above this point, a connection will respond to load increases by retransmitting packets that have only been delayed in transit. This makes the network do useless work (wasting bandwidth on duplicates of packets that have been or will be delivered) at a time when it's known to be having trouble with useful work. I.e., this is the network equivalent of pouring gasoline on a fire. We came up a cheap method for estimating variation (see first of attached msgs) and the resulting retransmit timer essentially eliminates spurious retransmissions. A pleasant side effect of estimating "beta" rather than using a fixed value is that low load as well as high load performance improves, particularly over high delay paths such as satellite links. Another timer mistake seems to be in the backoff after a retransmit: If we have to retransmit a packet more than once, how should the retransmits be spaced? It turns out there's only one scheme that's likely to work, exponential backoff, but proving this is a bit involved (one of the two papers alluded to in to opening goes through a proof). We might finesse a proof by noting that a network is, to a very good approximation, a linear system. (That is, it is composed of elements that behave like linear operators -- integrators, delays, gain stages, etc.). Linear system theory says that if a system is stable, the stability is exponential. This suggests that if we have a system that is unstable (a network subject to random load shocks and prone to congestive collapse), the way to stabilize it is to add some exponential damping (read: exponential timer backoff) to its primary excitation (read: senders, traffic sources). Once the timers are in good shape, you can state with some confidence that a timeout really means a lost packet and not a busted timer. At this point you can do something about (3). Packets can get lost for two reasons: they are damaged in transit or the network is congested and somewhere on the path there was insufficient buffer capacity. On most of the network paths we use, loss due to damage is rare (<<1%) so it is very probable that a packet loss is due to congestion in the network. Say we try to develop a `congestion avoidance' strategy like the one Raj Jain, et.al., propose in DEC TR506 and ISO 8473. It must have two components: The network must be able to signal the transport endpoints that congestion is occurring (or about to occur). And the endpoints must have a policy that decreases utilization if this signal is received and increases utilization if the signal isn't received. If packet loss is (almost) always due to congestion and if a timeout is (almost) always due to a lost packet, we have a good candidate for the `network is congested' signal. Particularly since this signal is delivered automatically by all existing networks, without special modification (as opposed to, say, ISO 8473 which requires a new bit in the packet headers and a mod to *all* existing gateways to set this bit). [As a (lengthy) aside: Using `lost packets' to communicate information seems to bother people. The second of the two papers mentioned above is devoted to analytic and simulation investigations of our algorithm under various network conditions. I'll briefly mention some of the objections we've heard and, I think, addressed. There have been claims that signaling congestion via dropping packets will adversely affect total throughput (it's easily proved that the opposite is true) or that this signal will be `too slow' compared to alternatives (The fundamental frequency of the system is set by its average round trip time. From control theory and Nyquist's theorem, any control strategy that attempts to respond in less than two round trip times will be unstable. A packet loss is detected in at most two rtt, the minimum stable response time). There have also been claims that this scheme will result in unfair or sub-optimal resource allocation (this is untrue if equivalent implementations of ISO and our schemes are compared) or that mistaking damage for congestion will unnecessarily throttle the endpoints on some paths with a high intrinsic loss rate (this is mostly untrue -- the scheme we propose is analytically tractable so we've worked out its behavior under random loss. It turns out that window flow control schemes just don't handle high loss rates well and throughput of a vanilla TCP under high, random loss is abysmal. Adding our congestion control scheme makes things slightly worse but the practical difference is negligible. As an example (that we have calculated and Jon Crowcroft at UCL has measured), it takes 40 256-byte packets to fill the Satnet pipe. Satnet currently shows a random, 1% average packet loss. The maximum throughput a vanilla tcp could achieve at this loss rate is 56% of the 40kbs channel bandwidth. The maximum throughput our TCP can achieve at this loss rate is 49% of the channel bandwidth. In theory, the use of dynamic congestion control should allow one to achieve much better throughput under high loss than is possible with normal TCP -- performance comparable to, say, NETBLT over the same lossy link. The reason is that regular TCP tries to communicate two different things with one number (the window field in the packet): the amount of buffer the receiver has available and the delay-bandwidth product of the pipe. Our congestion control scheme separates these two numbers. The sender estimates the pipe size so the receiver window need only describe the receiver's buffer size. As long as the receiver advertises a sufficiently large buffer, (twice the delay- bandwidth product) a 1% loss rate would translate into a 1% throughput effect, not a factor-of-two effect. Of course, we have yet to put this theory to test.] The other part of a congestion avoidance strategy, the endnode action, is almost identical in Jain's DEC/ISO scheme and our TCP. This is not an accident (we copied Jain's scheme after hearing his presentation at the Boston IETF & realizing that the scheme was, in a sense, universal): The endnode strategy follows directly from a first-order time-series model of the network. Say we measure network `load' by average queue length over fixed intervals of some appropriate length (something near the rtt). If L(i) is the load at interval i, a network not subject to congestion can be modeled by saying L(i) changes slowly compared to the sampling time. I.e., L(i) = N (the average queue length doesn't change over time). If the network is subject to congestion, this zero'th order model breaks down. The average queue length becomes the sum of two terms, the N above that accounts for the average arrival rate of new traffic and a new term that accounts for the left-over traffic from the last time interval: L(i) = N + a L(i-1) (As pragmatists, we extend the original model just enough to explain the new behavior. What we're doing here is taking the first two terms in a taylor series expansion of L(t) when we find that just the first term is inadequate. There is reason to believe one would eventually need a three term, second order model, but not until the Internet has grown to several times its current size.) When the network is congested, the `a' term must be large and the load (queues) will start increasing exponentially. The only way to stabilize the system is if the traffic sources throttle back at least as fast as the queues are growing. Since the way a source controls load in a window-based protocol is to adjust the size of the window, W, we end up with the sender policy On congestion: W(i) = d W(i-1) (d < 1) I.e., a multiplicative decrease of the window size. (This will turn into an exponential decrease over time if the congestion persists.) If there's no congestion, the `a' term must be near zero and the network load is about constant. You have to try to increase the bandwidth you're using to find out the current limit (e.g., you could have been sharing the path with someone else and converged to a window that gives you each half the available bandwidth. If he shuts down, 50% of the bandwidth will get wasted unless you make some effort to increase your window size.) What should the increase policy be? The first thought is to use a symmetric, multiplicative increase, possibly with a longer time constant. E.g., W(i) = b W(i-1), 1 < b <= 1/d. This is a mistake. The result will oscillate wildly and, on the average, deliver poor throughput. The reason why is tedious to explain. It has to do with that fact that it is easy to drive the net into saturation but hard for the net to recover (what Kleinrock, vol.2, calls the "rush-hour effect"). Thus overestimating the available bandwidth is costly. But an exponential, almost regardless of its time constant, increases so quickly that large overestimates are inevitable. Without justification, I'll simply state that the best increase policy is to make small, constant changes to the window size (if you've had a control theory course, you've seen the justification): On no congestion: W(i) = W(i-1) + u (u << the path delay- bandwidth product) I.e., an additive increase policy. This is exactly the policy that Jain, et.al., suggest in DEC TR-506 and exactly the policy we've implemented in TCP. The only difference in our implementations is the choice of constants for `d' and `u'. We picked .5 and 1 for reasons that are partially explained in the second of the attached messages. A more complete analysis is in the second in-progress paper (and may be discussed at the upcoming IETF meeting). All of the preceding has probably made it sound as if the dynamic congestion algorithm is hairy but it's not. Like slow-start, it turns out to be three lines of code and one new connection state variable (the combined slow-start/congestion control algorithm is described in the second of the attached msgs). That covers about everything that's been done to date. It's about 1/3 of what we think clearly needs to be done. The next big step is to do the gateway `congestion detection' algorithms so that a signal is sent to the endnodes as early as possible (but not so early that the gateway ends up starved for traffic). The way we anticipate doing these algorithms, gateway `self protection' from a mis-behaving host will fall-out for free (that host will simply have most of its packets dropped as the gateway trys to tell it that it's using more than its fair share). Thus, like the endnode algorithm, the gateway algorithm will improve things even if no endnode is modified to do dynamic congestion avoidance. And nodes that do implement congestion avoidance will get their fair share of bandwidth and a minimum number of packet drops. Since congestion grows exponentially, detecting it early is important. (If it's detected early, small adjustments to the senders' windows will cure it. Otherwise massive adjustments will be necessary to give the net enough spare capacity to pump out the backlog.) But, given the bursty nature of traffic, reliable detection is a non-trivial problem. There is a scheme proposed in DEC TR-506 but we think it might have convergence problems under high load and/or significant second-order dynamics in the traffic. We are thinking of using some of our earlier work on ARMAX models for rtt/queue length prediction as the basis of the detection. Preliminary results suggest that this approach works well at high load, is immune to second-order effects in the traffic and is cheap enough to compute that it wouldn't slow down thousand-packet-per-second gateways. In addition to the gateway algorithms, we want to apply the endnode algorithms to connectionless traffic (e.g., domain server queries, RPC requests). We have the rate-based equivalent of the TCP window algorithm worked out (there should be a message describing it on the tcp list sometime in the near future). Sun Microsystems has been very interested, and supportive, during the development of the TCP congestion control (I believe Sun OS 4.0 will ship with our new TCP) and Sun has offered to cooperate in developing the RPC dynamic congestion control algorithms, using NFS as a test-bed (since NFS is known to have congestion problems). The far future holds some work on controlling second-order, non- ergodic effects, adaptively determining the rate constants in the control algorithm, and some other things that are too vague to mention. - Van