Date: 10 Mar 88 03:45:47 GMT From: van@LBL-CSAM.ARPA (Van Jacobson) Subject: Re: maximum Ethernet throughput Article: 167 of comp.protocols.tcp-ip Newsgroups: comp.protocols.tcp-ip Sender: usenet@ucbvax.BERKELEY.EDU I don't know what would constitute an "official" confirmation but maybe I can put some rumors to bed. We have done a TCP that gets 8Mbps between Sun 3/50s (the lowest we've seen is 7Mbps, the highest 9Mbps -- when using 100% of the wire bandwidth, the ethernet exponential backoff makes throughput very sensitive to the competing traffic distribution.) The throughput limit seemed to be the Lance chip on the Sun -- the CPU was showing 10-15% idle time. I don't believe the idle time number (I want to really measure the idle time with a uprocessor analyzer) but the interactive response on the machines was pretty good even while they were shoving 1MB/s at each other so I know there was some CPU left over. Yes, I did crash most of our VMS vaxen while running throughput tests and no, this has nothing to do with Sun violating protocols -- the problem was that the DECNET designers failed to use common sense. I fired off a 1GB transfer to see if it would really finish in 20 minutes (it took 18 minutes) and halfway through I noticed that our VMS 780 was rebooting. When I later looked at the crash dump I found that it had run out of non-paged pool because the DEUNA queue was full of packets. It seems that whoever did the protocols used a *linear* backoff on the retransmit timer. With 20 DECNET routers trying to babble the state of the universe every couple of minutes, and my Suns keeping the wire warm in the interim, any attempt to access the ether was going to put a host into serious exponential backoff. Under these circumstances, a linear transport timer just doesn't cut it. So I found 25 retransmissions in the outbound queue for every active DECNET connection. I know as little about VMS as possible so I didn't investigate why the machine had crashed rather than terminating the connections gracefully. I should also note that NFS on our other Sun workstations wasn't all that happy about waiting for the wire: As I walked around the building, every Sun screen was filled with "server not responding" messages. (But no Sun crashed -- I later shut most of them down to keep ND traffic off the wire while I was looking for the upper bound on xfer rate.) I did run two simultaneous 100MB transfers between 4 3/50s and verified that they were gracious about sharing the wire. The total throughput was 7Mbps split roughly 60/40. The tcpdump trace of the two conversations has some holes in it (tcpdump can't quite hack a packet/millisecond, steady state) but the trace doesn't show anything weird happening. Quite a bit of the speedup comes from an algorithm that we (`we' refers to collaborator Mike Karels and myself) are calling "header prediction". The idea is that if you're in the middle of a bulk data transfer and have just seen a packet, you know what the next packet is going to look like: It will look just like the current packet with either the sequence number or ack number updated (depending on whether you're the sender or receiver). Combining this with the "Use hints" epigram from Butler Lampson's classic "Epigrams for System Designers", you start to think of the tcp state (rcv.nxt, snd.una, etc.) as "hints" about what the next packet should look like. If you arrange those "hints" so they match the layout of a tcp packet header, it takes a single 14-byte compare to see if your prediction is correct (3 longword compares to pick up the send & ack sequence numbers, header length, flags and window, plus a short compare on the length). If the prediction is correct, there's a single test on the length to see if you're the sender or receiver followed by the appropriate processing. E.g., if the length is non-zero (you're the receiver), checksum and append the data to the socket buffer then wake any process that's sleeping on the buffer. Update rcv.nxt by the length of this packet (this updates your "prediction" of the next packet). Check if you can handle another packet the same size as the current one. If not, set one of the unused flag bits in your header prediction to guarantee that the prediction will fail on the next packet and force you to go through full protocol processing. Otherwise, you're done with this packet. So, the *total* tcp protocol processing, exclusive of checksumming, is on the order of 6 compares and an add. The checksumming goes at whatever the memory bandwidth is so, as long as the effective memory bandwidth at least 4 times the ethernet bandwidth, the cpu isn't a bottleneck. (Let me make that clearer: we got 8Mbps with checksumming on). You can apply the same idea to outgoing tcp packets and most everywhere else in the protocol stack. I.e., if you're going fast, it's real likely this packet comes from the same place the last packet came from so 1-behind caches of pcb's and arp entries are a big win if you're right and a negligible loss if you're wrong. In addition to the header prediction, I put some horrible kluges in the mbuf handling to get the allocation/deallocations down to 1 per packet. Mike Karels has been working in the same area and his clean code is faster than my kluges. As soon as this semester is over, I plan to merge Mike's and my versions then the two of us will probably make a pass at knocking off the biggest of the rough edges. Sometime in late spring or early summer we should be passing this code out to hardy souls for beta-testing (but ask yourself: do you really want workstations that routinely use 100% of the ethernet bandwidth? I'm pretty sure we don't and we're not running this tcp on any of our workstations.) Some of the impetus for this work came from Greg Chesson's statement at the Phoenix USENIX that `the way TCP and IP are designed, it's impossible to make them go fast'. On hearing this, I lept to my feet to protest but decided that saying "I think you're wrong" wasn't going to change anybody's mind about anything. Now I can say that I'm pretty sure he was wrong about TCP (but that's *not* a comment on the excellent work he's doing and has done on the XTP and the Protocol Engine). The header prediction algorithm evolved during attempts to make my 2400-baud SLIP dial-up send 4 bytes when I typed a character rather than 44. After staring at packet streams for a while, it was pretty obvious that the receiver could predict everything about the next packet on a TCP data stream except for the data bytes. Thus all the sender had to ship in the usual case was one bit that said "yes, your prediction is right" plus the data. I mention this because funding agents looking for high speed, next-generation networks may forget that research to make slow things go fast sometimes makes fast things go faster. - Van