From van@lbl-csam.arpa  Mon Mar  7 00:59:17 1988
Posted-Date: Mon, 07 Mar 88 00:57:36 PST
Received-Date: Mon, 7 Mar 88 00:59:17 PST
Received: from LBL-CSAM.ARPA by venera.isi.edu (5.54/5.51)
	id AA04737; Mon, 7 Mar 88 00:59:17 PST
Received: by lbl-csam.arpa (5.58/1.18)
	id AA03395; Mon, 7 Mar 88 00:57:38 PST
Message-Id: <8803070857.AA03395@lbl-csam.arpa>
To: Jon Crowcroft <jon@cs.ucl.ac.uk>, KSEO@g.bbn.com
Cc: end2end-interest@venera.isi.edu
Subject: Re: Latest TCP measurements thoughts
In-Reply-To: Your message of Wed, 02 Mar 88 11:16:22 GMT.
Date: Mon, 07 Mar 88 00:57:36 PST
From: Van Jacobson <van@lbl-csam.arpa>
Status: R

Jon & Karen -

I was off at the ietf meeting for the past week but I've just
ftp'd the jotun-ego tcpdump trace and looked at some of the
behavior.  I suspect that mysterious congestion window & ack
clumping effects have little to do with this trace.  All the
behavior I see could be explained by Satnet's ridiculously high
error rate (on this test, the BER was at least 1.8e-6
errors/bit).

Like Jon, I see absolutely no sign of two acks getting stuck
together by PODA.  Just the opposite in fact.  The acks (and the
packets they generate) get nicely spread out over the entire 2
second rtt.  This is gratifying (to me) because (a) this is
exactly what the congestion control mods are supposed to do, and
(b) I can't think of a good reason for ack clumping on this test
(the clumping we saw before had to do with an *echo test* where
there's a peculiar phase relationship between the data packets
and acks).

I see 4 events where the window increase policy has hit the
bandwidth limit & been forced to back off.  In all four cases,
the bandwidth demand just before the loss was 4KBs (38Kbs on the
wire -- I'll give user data throughput in KBytes/sec and the
equivalent wire throughput in Kbits/sec) -- a number that seems
awfully close to Claudio's theoretical max.  In all four cases
the average throughput loss-to-loss was 3.9KBs (36Kbs) and the
retransmission timing shows that a large queue had built up
prior to the loss.  Since the channel always had a queue to work
on, it stayed busy 100% of the time and the retransmission time
had no effect on the average throughput.  Thus the bandwidth
lost to the congestion control code was the cost of sending one
packet through the wire twice.  The average distance between
losses was ~200KB and a packet is ~.5KB so the cost of the
congestion control `testing' was 1 part in 400 or .25% -- well
under our design goal of <= 1%.

But, in addition to the 4 congestion control losses, there are
13 packet losses that are clearly random (i.e., unrelated to the
bandwidth demand).  There were 998KB xferred in the portion of
the trace I could ftp (BTW Jon, do you realize that the tcp
you're running on purple has a bug that will cause it to
randomly truncate a send if a FIN has to be retransmitted?).
998KB @ 420 B/packet = 2400 packets.  17 losses for 2400 packets
--> .7% avg loss rate or 144 packets between losses on the
average.  The Satnet pipe size is 2.2 sec * ~4KBs (user data
bandwidth) = 8.8KB / 420 B/packet = 21 packets.  Thus for every
144/21 = 7 windows of data we transfer, we let the pipe sit
empty for 2 rtt while we recover from a loss.  So the best
throughput any tcp could get at this error rate is 7/9 = 78% of
the available bandwidth.  You lose another 25% of this because
of the xtcp window shrinking so I would expect something like
.75*.78 = 58% of the available bandwidth or ~ 2KBs.  Overall you
got 1.6KBs (there was a strange event at about 400 sec. that
accounts for the 25% discrepancy -- the channel seemed to go
crazy and dropped 8 packets in a 60 second interval.  The drop
spacing was just right to get the window closed down to minimum.)

I don't see anywhere where the packet re-ordering (and there's a
lot of re-ordering) has made any difference.  You could get a
small improvement (~15%) by setting the sender's window size to
the pipesize (8KB) rather than twice the pipesize (16KB).  You
could get a large improvement (~90%) by getting the error rate
down from 1% to around .1%.  Based on the loss statistics, it
doesn't look like there would be any advantage in dropping the
MTU to prevent IP fragmentation (the opposite in fact, it looks
like a smaller MTU would result in a 12% throughput loss -- BTW,
the max IP packet size used to be 252 bytes.  In this trace it's
suddenly 244 bytes.  Where did the other 8 bytes go?)  From this
data, I don't see where changing the congestion control or ack
policies would buy anything.

If you're looking for lessons from Satnet, I can think of a few.
One is that the error rate on a link with a large bandwidth-
delay product *must* be small, at most .01/(bandwidth delay), or
you're going to take a big throughput hit (a transcontinental
DS3 link has a 4Mb bandwidth delay.  It's not clear to me that
people planning such links realize that a BER of 5e-9 is
way too much).  Another is that any bottleneck needs to
have 2*bandwidth*delay worth of buffering *at the bottleneck* or
it's going to be `fragile'.  And a lot of buffer upstream, like
in the Butterfly before a SIMP, might be worse than no buffer at
all.

 - Van

ps- Karen, we don't ``rely on the receiving TCP implementation's
    ack policy''.  If we happen to be talking through that
    brain-damaged Lisp machine that forwards packets in reverse
    order, we will send data as well as or better than any other
    TCP.  If we happen to be talking to a ka9q PC that generates
    nothing for out-of-order packets, we will send data to it at
    least as well any other TCP -- In fact, we have *measured*
    xtcp -> ka9q vs. ka9q -> ka9q and we ship more data with fewer
    retransmits.  If we happen to be talking to a TCP that did
    obvious, sensible things in its implementation, like not
    gratuitously scrambling packets or waiting forever to spit back
    acks, I think we've shown that xtcp does very well indeed
    compared to other TCPs.  Since 99+% of the TCPs in the world are
    running 4BSD TCP variants that made (mostly) reasonable design
    decisions, the xtcp algorithms tend to win far more often than
    they lose.  In other words, we don't expect or require that the
    other end be sane.  But if it is, we'll try to have an
    intelligent conversation with it.  The alternative is treating
    everything as if it were mad.  The Arpanet collapse suggests
    that this simply amplifies the madness.