Path: kinetics!zehntel!varian!ptsfa!ames!eos!aurora!labrea!decwrl!decvax!ucbvax !LBL-CSAM.ARPA!van From: van@LBL-CSAM.ARPA (Van Jacobson) Newsgroups: comp.protocols.tcp-ip Subject: more on tcp congestion control Message-ID: <8802230539.AA24469@lbl-csam.arpa> Date: 23 Feb 88 05:39:37 GMT Sender: daemon@ucbvax.BERKELEY.EDU Organization: The Internet Lines: 257 Phil - Thanks for correcting the CUTE reference (I've got to start putting JSAC and COM in separate piles) and thanks for your interesting message on changes to the ka9q package. You bring up a lot of things that I should have covered so I'm cc'ing this reply to the tcp/ip list (in the unlikely event that anyone else is interested). > I suggest rounding your computed RTO values up to the next > highest number of ticks when setting the timer. This is > especially important when you calculate retransmission timeouts > based on mean deviations, since they can be very small. > Otherwise it's possible for truncation effects to set the timer > LESS than the current SRTT. Good point. I should have edited that timer msg I inserted because it contained several rounding errors. The line rto = (Avg >> 3) + (Mdev >> 1); should have read rto = ((Avg >> 2) + Mdev + 3) >> 1; which is how things read in the timer rfc & the bsd code -- Avg could also be rounded but it doesn't make much practical difference. The mysterious "+ 3" is half a tick for rounding plus one tick because the send is phased randomly with respect to the clock (i.e., there's a +-1/2 tick uncertainty in when the timer goes off). Also, because of the timer uncertainty, the minimum value for rto should be two ticks. You can save a "if (rto < 2) rto = 2" step, at the cost of a half-tick bias in rto, if you make the above rto = ((Avg >> 2) + Mdev + 4) >> 1; > 3. When you refer to "packets" in the context of setting the > congestion window in TCP, what do you mean? One MSS? Yes, by "packet" without a qualifier (such as "arbitrary size packet") I always mean a one MSS sized packet. > I tried that first, but I ran into problems. When I have an > entire SLIP link to myself, the mean deviation sits at a very > low value. Thus the RTO sits just above the SRTT. But when > slowstart opens up the window to send more, the round trip time > immediately increases beyond the retransmission timeout and an > unnecessary retransmission occurs. The congestion window > oscillates rapidly, with an unnecessary retransmission in each > cycle. I think I understand the problem (although I think you're confusing the slowstart & congestion control algorithms -- more on that in the next section) but I'm not sure I agree with the analysis. I run tcp stuff, bind & NFS over a 2400 baud slip link from home to work and I've spent a bit of time investigating the link's behavior (waiting for the damn slow link gave me a lot of time when I couldn't do anything but investigate the link behavior :-). Having the rtt variance in addition to the rtt is a big help when you're trying to start a bandwidth-limited link. Since errors in the variance decay quickly, you can afford to pump it up whenever you're uncertain about how long a send is going to take. In particular, when starting we set the variance to a `big number' (default is 3 sec but this can be overridden on a per-route basis) and the rtt to either our best estimate (also kept in the route) or zero. After the syn packet, we'll know the rtt for a zero length packet and rttvar will get decayed by at most 25%. As long as the the rtt for the first data packet changes by < 2*.75*bigNumber, we won't get a spurious timeout. We're free to overestimate bigNumber initially because even a factor of 10 error will be `forgotten' in 8 packet exchanges. The same scheme applies for the 2nd and future packet exchanges. If the link is delay limited, increasing the window from one to two packets will have no effect. If the link is bandwidth limited, doubling the window size will double the rtt. In general, given that we know the rtt, R(W), for a window size of W, the worst case rtt change for a window size of W+dW is R(W)*dW/W. We don't know if we'll see the worse case but the above reflects our "uncertainty" in the rtt of the next packet if we increase the window size. Thus it should get lumped into rttvar: rttvar += rtt * dW/W; During slowstart, dW is the max segment size, mss, and W is snd.cwnd so the above becomes rttvar += rtt * mss/snd.cwnd before each snd.cwnd change. (The above needs appropriate scaling and rounding but they're implemenatation dependent. Also, it overestimates the possible rtt change when snd.cwnd is small but that turns out to be desirable.) > So I'm trying something else. I initialize the congestion window > at one MSS, but then I increase the congestion window by the > actual amount of data acked each time an ACK comes in. This > effectively slows the slow-start algorithm so the timer can have > more time to adapt. Or at least the congestion window > oscillation occurs at a much lower rate. What do you think? If you mean open cwnd by MIN(amount.acked, mss), that's a real interesting idea and I'm going to think about it. The above timer hacking takes care of spurious retransmits but there's another problem: Slowstart is designed to prevent hitting a gateway with bursts of packets. If you send a few small packets when the connection starts, their acks will open the window so that if you suddenly decide to send a bunch of data, you'll hit the gateway with a burst. This small/large switch is exactly what happens in an SMTP conversation & it looks like your mod will cure the problem. But, if you mean what you said, I don't think it's a good idea. Consider what happens when you send 1,2,3,4 and 1 is dropped. 2-4 are cached on the other side so when you retransmit 1 and fill the hole in the sequence space, all four packets will be acked. If you increment by the amount acked, cwnd will get fully opened and you'll blast out a window's worth of back-to-back packets, almost guaranteeing another drop. (I've shown the IETF lots of pictures of this kind of stable failure mode.) > 4. I'm confused by the modified slow-start algorithm you > described, the one that uses a threshold to change the amount > the congestion window is increased. Again, what does it mean to > increase the congestion window by 1/cwind? Is cwind in packets > (what size?) or bytes (increase the window by fractions of a > byte?) We found it was important to distinguish startup behavior from steady-state behavior. (It sounds as if you're lumping the two together.) The thing we call "slow start" just has to do with starting data flowing. The algorithm has a very short lifetime -- it typically runs for the first 3-5 rtt after you first start sending on a connection or restart after a drop. When slow start shuts down, the link is in equilibrium (the pipe is full and you've exchanged a full window of data so you know the ack "clock" is working). At this point another algorithm, congestion avoidance, takes over. This algorithm is responsible for estimating how much data is needed to fill the pipe under the current (time-varying) load. The congestion avoidance & slowstart are separate algorithms with completely different objectives. They just happen to run consecutively and just happen to accomplish their individual objectives by twiddling the same state variable (cwnd). The congestion avoidance algorithm is described well in DEC technical report DEC-TR-506, "Congestion Avoidance in Computer Networks with a Connectionless Network Layer" by Raj Jain, K. K. Ramakrishnan, Dah-Ming Chiu, August, 1987. The idea is that when you try to use a window larger than the bandwidth*delay of the network path, the excess window won't fit in the pipe so it must show up in a queue somewhere. In the usual case, "somewhere" will be the outbound queue of the slowest link in the path. That link will probably be a bottleneck for a lot of other conversations and it's likely its queue will fill up and overflow. We treat this as a signal that our window is too large and reduce it (Jain doesn't wait for an overflow -- his signal is a bit in packet that the gateway can set to say its congested.) But the `bandwidth' in the delay-bandwidth product is *your share* of the fixed link bandwidth and your share can change as conversations using the same path come and go. The gateway tells you when you're using too much but says nothing when you're using too little. To find out if someone has shut down and your share is now larger, you continuously test by slowly increasing your window size until the gateway says you're using too much. (Thus the window will oscillate around some mean value. The algorithm is designed to do this. But we've tried to set the adjustable parameters that determine the oscillation so that the bandwidth lost to testing the path is at most 1%.) >From a linear system argument, you can show that best probe policy is to make your bandwidth vs. time obey dB/dt = C for some constant C. Since B = W / R, where W is the window size and R is the round trip time, and R is constant, this says you want dW/dt = c for some other c. In other words, in one rtt we want the window to go smoothly from size W to size W+c. I want the algorithm to be "self clocked" so we go looking for how to do the above change using acks rather than time to drive the dW. If the max packet size is mss and we have decent silly window code, a window of W will generate at most W/mss acks and it will generate them spread out over one rtt. Thus if we increment by c/(W/mss) = c*mss/W per ack, we will have opened the window by c after one rtt. An appropriate c for current arpanet conditions turns out to be one mss so the increment part of the congestion avoidance turns into snd.cwnd += mss*mss / snd.cwnd; (you might have to worry about the the fixed-point in this if you were trying to send small packets over a net with a large bandwidth delay product but it's not a problem in any net I know of.) > 5. You didn't specifically mention ICMP source quench, but I > assume you mean that the congestion window should be cut in half > each time you get one. To make this meaningful I also limit the > congestion window to be no more than the offered window. One > question: should the congestion window be allowed to decrease > below one MSS? I would say yes, but I'd like to hear your > opinion. I didn't mention source quench because it's a disaster and I keep hoping it will go away. There must be at least one rtt of hysteresis or "dead time" for any control algorithm to be stable. By using packet drops and timeouts to drive things, we get this automatically. But it's very, very hard to get the right kind of hysteresis with source quench. It's also possible to show that source quench leads to a very unfair bandwidth allocation -- SQ is preferentially delivered to the hosts using the smallest windows (I have lots of trace data showing this and can explain why it happens but the explanation is mathematical and quite involved). So, since SQ currently happens only when one or more packets has been dropped, all we do is close snd.cwnd down to one packet (so no more will get dropped) but we don't touch snd.ssthresh (because the slowstart time is short, ssthresh is the crucial window control, not cwnd). Ssthresh and the rest of the congestion control stuff will get handled correctly, and with the right "time constants", when we eventually take the timeout for the lost packet. We limit cwnd to be at most the offered window (you violate rfc793 if you ever send more than the offered window). I don't think it's a good idea to let cwnd get below one mss. Dropping cwnd below mss forces you to send small packets, increasing the overhead per data byte xferred. Thus you are using the network less efficiently at a time it's heavily loaded and you really want to make the most efficient possible use of it. This is an unstable positive feedback that can lead to collapse. If mss is set correctly, it never makes sense to send a smaller packet. > 6. It's interesting to note that the "set the congestion window > to one packet after a timeout" rule automatically implies the > "first-only retransmission" rule, thus simplifying my code > somewhat. In effect, a timeout is just a very drastic source > quench. Yes, there were several sort-of ad-hoc pieces of code, like the first-only rexmit and all the source quench handling, that were taken out because their job was handled in a more unified way by the slowstart/cong. avoidance framework -- it was very satisfying. We removed something like 15 lines and replaced them with 6. Of course, we then stuck in your clamped retransmit backoff algorithm, a fast retransmit algorithm (that got described at one of the IETF meetings) and a few more goodies, and ended up +3 lines from the original 4.3 code instead of the -9 we expected (not including, of course, dozens of lines of bug fixes). I prefer to think of source quench as a poorly considered, ill behaved timeout :-). - Van