Date: 10 Mar 88 03:45:47 GMT
From: van@LBL-CSAM.ARPA (Van Jacobson)
Subject: Re: maximum Ethernet throughput
Article: 167 of comp.protocols.tcp-ip
Newsgroups: comp.protocols.tcp-ip
Sender: usenet@ucbvax.BERKELEY.EDU

I don't know what would constitute an "official" confirmation
but maybe I can put some rumors to bed.  We have done a TCP that
gets 8Mbps between Sun 3/50s (the lowest we've seen is 7Mbps,
the highest 9Mbps -- when using 100% of the wire bandwidth, the
ethernet exponential backoff makes throughput very sensitive to the
competing traffic distribution.)  The throughput limit seemed to
be the Lance chip on the Sun -- the CPU was showing 10-15% idle
time.  I don't believe the idle time number (I want to really
measure the idle time with a uprocessor analyzer) but the
interactive response on the machines was pretty good even while
they were shoving 1MB/s at each other so I know there was some
CPU left over.

Yes, I did crash most of our VMS vaxen while running throughput
tests and no, this has nothing to do with Sun violating
protocols -- the problem was that the DECNET designers failed to
use common sense.  I fired off a 1GB transfer to see if it would
really finish in 20 minutes (it took 18 minutes) and halfway
through I noticed that our VMS 780 was rebooting.  When I later
looked at the crash dump I found that it had run out of non-paged
pool because the DEUNA queue was full of packets.  It seems that
whoever did the protocols used a *linear* backoff on the
retransmit timer.  With 20 DECNET routers trying to babble the
state of the universe every couple of minutes, and my Suns
keeping the wire warm in the interim, any attempt to access the
ether was going to put a host into serious exponential backoff.
Under these circumstances, a linear transport timer just doesn't
cut it.  So I found 25 retransmissions in the outbound queue for
every active DECNET connection.  I know as little about VMS as
possible so I didn't investigate why the machine had crashed
rather than terminating the connections gracefully.  I should
also note that NFS on our other Sun workstations wasn't all that
happy about waiting for the wire:  As I walked around the building,
every Sun screen was filled with "server not responding" messages.
(But no Sun crashed -- I later shut most of them down to keep ND
traffic off the wire while I was looking for the upper bound on
xfer rate.)

I did run two simultaneous 100MB transfers between 4 3/50s and
verified that they were gracious about sharing the wire.  The
total throughput was 7Mbps split roughly 60/40.  The tcpdump
trace of the two conversations has some holes in it (tcpdump
can't quite hack a packet/millisecond, steady state) but the
trace doesn't show anything weird happening.

Quite a bit of the speedup comes from an algorithm that we (`we'
refers to collaborator Mike Karels and myself) are calling
"header prediction".  The idea is that if you're in the middle
of a bulk data transfer and have just seen a packet, you know
what the next packet is going to look like:  It will look just
like the current packet with either the sequence number or ack
number updated (depending on whether you're the sender or
receiver).  Combining this with the "Use hints" epigram from
Butler Lampson's classic "Epigrams for System Designers", you
start to think of the tcp state (rcv.nxt, snd.una, etc.) as
"hints" about what the next packet should look like.

If you arrange those "hints" so they match the layout of a tcp
packet header, it takes a single 14-byte compare to see if your
prediction is correct (3 longword compares to pick up the send &
ack sequence numbers, header length, flags and window, plus a
short compare on the length).  If the prediction is correct,
there's a single test on the length to see if you're the sender
or receiver followed by the appropriate processing.  E.g., if
the length is non-zero (you're the receiver), checksum and
append the data to the socket buffer then wake any process
that's sleeping on the buffer.  Update rcv.nxt by the length of
this packet (this updates your "prediction" of the next packet).
Check if you can handle another packet the same size as the
current one.  If not, set one of the unused flag bits in your
header prediction to guarantee that the prediction will fail on
the next packet and force you to go through full protocol
processing.  Otherwise, you're done with this packet.  So, the
*total* tcp protocol processing, exclusive of checksumming, is
on the order of 6 compares and an add.  The checksumming goes
at whatever the memory bandwidth is so, as long as the effective
memory bandwidth at least 4 times the ethernet bandwidth, the
cpu isn't a bottleneck.  (Let me make that clearer: we got 8Mbps
with checksumming on).

You can apply the same idea to outgoing tcp packets and most
everywhere else in the protocol stack.  I.e., if you're going
fast, it's real likely this packet comes from the same place
the last packet came from so 1-behind caches of pcb's and arp
entries are a big win if you're right and a negligible loss if
you're wrong.

In addition to the header prediction, I put some horrible kluges
in the mbuf handling to get the allocation/deallocations down to
1 per packet.  Mike Karels has been working in the same area and
his clean code is faster than my kluges.  As soon as this
semester is over, I plan to merge Mike's and my versions then
the two of us will probably make a pass at knocking off the
biggest of the rough edges.  Sometime in late spring or early
summer we should be passing this code out to hardy souls for
beta-testing (but ask yourself: do you really want workstations
that routinely use 100% of the ethernet bandwidth?  I'm pretty
sure we don't and we're not running this tcp on any of our
workstations.)

Some of the impetus for this work came from Greg Chesson's
statement at the Phoenix USENIX that `the way TCP and IP are
designed, it's impossible to make them go fast'.  On hearing
this, I lept to my feet to protest but decided that saying "I
think you're wrong" wasn't going to change anybody's mind about
anything.  Now I can say that I'm pretty sure he was wrong about
TCP (but that's *not* a comment on the excellent work he's doing
and has done on the XTP and the Protocol Engine).

The header prediction algorithm evolved during attempts to make
my 2400-baud SLIP dial-up send 4 bytes when I typed a character
rather than 44.  After staring at packet streams for a while, it
was pretty obvious that the receiver could predict everything
about the next packet on a TCP data stream except for the data
bytes.  Thus all the sender had to ship in the usual case was
one bit that said "yes, your prediction is right" plus the data.
I mention this because funding agents looking for high speed,
next-generation networks may forget that research to make slow
things go fast sometimes makes fast things go faster.

 - Van