From van@lbl-csam.arpa Mon Mar 7 00:59:17 1988 Posted-Date: Mon, 07 Mar 88 00:57:36 PST Received-Date: Mon, 7 Mar 88 00:59:17 PST Received: from LBL-CSAM.ARPA by venera.isi.edu (5.54/5.51) id AA04737; Mon, 7 Mar 88 00:59:17 PST Received: by lbl-csam.arpa (5.58/1.18) id AA03395; Mon, 7 Mar 88 00:57:38 PST Message-Id: <8803070857.AA03395@lbl-csam.arpa> To: Jon Crowcroft , KSEO@g.bbn.com Cc: end2end-interest@venera.isi.edu Subject: Re: Latest TCP measurements thoughts In-Reply-To: Your message of Wed, 02 Mar 88 11:16:22 GMT. Date: Mon, 07 Mar 88 00:57:36 PST From: Van Jacobson Status: R Jon & Karen - I was off at the ietf meeting for the past week but I've just ftp'd the jotun-ego tcpdump trace and looked at some of the behavior. I suspect that mysterious congestion window & ack clumping effects have little to do with this trace. All the behavior I see could be explained by Satnet's ridiculously high error rate (on this test, the BER was at least 1.8e-6 errors/bit). Like Jon, I see absolutely no sign of two acks getting stuck together by PODA. Just the opposite in fact. The acks (and the packets they generate) get nicely spread out over the entire 2 second rtt. This is gratifying (to me) because (a) this is exactly what the congestion control mods are supposed to do, and (b) I can't think of a good reason for ack clumping on this test (the clumping we saw before had to do with an *echo test* where there's a peculiar phase relationship between the data packets and acks). I see 4 events where the window increase policy has hit the bandwidth limit & been forced to back off. In all four cases, the bandwidth demand just before the loss was 4KBs (38Kbs on the wire -- I'll give user data throughput in KBytes/sec and the equivalent wire throughput in Kbits/sec) -- a number that seems awfully close to Claudio's theoretical max. In all four cases the average throughput loss-to-loss was 3.9KBs (36Kbs) and the retransmission timing shows that a large queue had built up prior to the loss. Since the channel always had a queue to work on, it stayed busy 100% of the time and the retransmission time had no effect on the average throughput. Thus the bandwidth lost to the congestion control code was the cost of sending one packet through the wire twice. The average distance between losses was ~200KB and a packet is ~.5KB so the cost of the congestion control `testing' was 1 part in 400 or .25% -- well under our design goal of <= 1%. But, in addition to the 4 congestion control losses, there are 13 packet losses that are clearly random (i.e., unrelated to the bandwidth demand). There were 998KB xferred in the portion of the trace I could ftp (BTW Jon, do you realize that the tcp you're running on purple has a bug that will cause it to randomly truncate a send if a FIN has to be retransmitted?). 998KB @ 420 B/packet = 2400 packets. 17 losses for 2400 packets --> .7% avg loss rate or 144 packets between losses on the average. The Satnet pipe size is 2.2 sec * ~4KBs (user data bandwidth) = 8.8KB / 420 B/packet = 21 packets. Thus for every 144/21 = 7 windows of data we transfer, we let the pipe sit empty for 2 rtt while we recover from a loss. So the best throughput any tcp could get at this error rate is 7/9 = 78% of the available bandwidth. You lose another 25% of this because of the xtcp window shrinking so I would expect something like .75*.78 = 58% of the available bandwidth or ~ 2KBs. Overall you got 1.6KBs (there was a strange event at about 400 sec. that accounts for the 25% discrepancy -- the channel seemed to go crazy and dropped 8 packets in a 60 second interval. The drop spacing was just right to get the window closed down to minimum.) I don't see anywhere where the packet re-ordering (and there's a lot of re-ordering) has made any difference. You could get a small improvement (~15%) by setting the sender's window size to the pipesize (8KB) rather than twice the pipesize (16KB). You could get a large improvement (~90%) by getting the error rate down from 1% to around .1%. Based on the loss statistics, it doesn't look like there would be any advantage in dropping the MTU to prevent IP fragmentation (the opposite in fact, it looks like a smaller MTU would result in a 12% throughput loss -- BTW, the max IP packet size used to be 252 bytes. In this trace it's suddenly 244 bytes. Where did the other 8 bytes go?) From this data, I don't see where changing the congestion control or ack policies would buy anything. If you're looking for lessons from Satnet, I can think of a few. One is that the error rate on a link with a large bandwidth- delay product *must* be small, at most .01/(bandwidth delay), or you're going to take a big throughput hit (a transcontinental DS3 link has a 4Mb bandwidth delay. It's not clear to me that people planning such links realize that a BER of 5e-9 is way too much). Another is that any bottleneck needs to have 2*bandwidth*delay worth of buffering *at the bottleneck* or it's going to be `fragile'. And a lot of buffer upstream, like in the Butterfly before a SIMP, might be worse than no buffer at all. - Van ps- Karen, we don't ``rely on the receiving TCP implementation's ack policy''. If we happen to be talking through that brain-damaged Lisp machine that forwards packets in reverse order, we will send data as well as or better than any other TCP. If we happen to be talking to a ka9q PC that generates nothing for out-of-order packets, we will send data to it at least as well any other TCP -- In fact, we have *measured* xtcp -> ka9q vs. ka9q -> ka9q and we ship more data with fewer retransmits. If we happen to be talking to a TCP that did obvious, sensible things in its implementation, like not gratuitously scrambling packets or waiting forever to spit back acks, I think we've shown that xtcp does very well indeed compared to other TCPs. Since 99+% of the TCPs in the world are running 4BSD TCP variants that made (mostly) reasonable design decisions, the xtcp algorithms tend to win far more often than they lose. In other words, we don't expect or require that the other end be sane. But if it is, we'll try to have an intelligent conversation with it. The alternative is treating everything as if it were mad. The Arpanet collapse suggests that this simply amplifies the madness.