From: van@HELIOS.EE.LBL.GOV (Van Jacobson) Newsgroups: comp.protocols.tcp-ip Subject: some interim notes on the bsd network speedups Message-ID: <8807200426.AA01221@helios.ee.lbl.gov> Date: 20 Jul 88 04:26:17 GMT I told the potential beta-tests for our new 4bsd network code that I hoped to have a version of the code out by the end of July. (BTW, we've got all the beta testers we can handle -- please don't apply.) It looks like that's going to turn into the end of August, in part because of SIGCOMM and in part because Sun puts sending source to academic customers on the bottom of its priority list. I thought I'd flame about the latter and give a bit of a status report on the new code. I've been trying to put together a beta distribution for the "header prediction" bsd network code. This effort has been stymied because it's impossible to get current source from Sun. The code involves major changes to the 4.3bsd kernel. The only local machines I can use to develop new kernels are Suns -- everything else is either multi-user or has pathetic ethernet hardware. But the only Sun kernel source I've got is the doubly obsolete Sun OS 3.5/4.2 BSD. It would be a massive waste of time to upgrade this system to 4.3 BSD just so I can develop the next BSD -- Bill Nowicki did the upgrade a year ago and binaries of the new system (totally worthless to me) are shipping as Sun OS 4.0. [I'm not the only one suffering this problem -- I understand Craig Partridge's multicast work is suffering because he can't get 4.3-compatible Sun source. I think he gave up & decided to do all his development on 4.3bsd Vaxen. And I think I heard Chuck Hedrick say 4.0 has all the rlogin, URG and nameserver bugs that we fondly remember fixing in 3.x. And he has to get source before the academic year starts or they won't be able to switch until a semester break. And Mike Karels is saying "I told you so" and suggesting I buy some CCIs. Pity that Sun can't figure out that it's in their best interest to do TIMELY source distribution to the academic and research community -- their software development base gets expanded a few hundred-fold for the cost of making tapes.] Anyway, now that I've vented my spleen, there are some interim results to talk about. While waiting for either useful source or a better hardware platform to show up, I've been cleaning up my original mods and backing out changes one and two at a time to gauge their individual effect. Because network performance seems to rest on getting a lot of things happening in parallel, this leave-one-out testing doesn't give simple good/bad answers (I had one test case that went like Basic system: 600 KB/s add feature A: 520 KB/s drop A, add B: 530 KB/s add both A & B: 700 KB/s Obviously, any statement of the form "feature A/B is good/bad" is bogus.) But, in spite of the ambiguity, some of the network design folklore I've heard seems to be clearly wrong. In the process of cleaning things up, they slowed down. Task- to-task data throughput using TCP between two Sun 3/60s dropped from 1 MB/s (about 8.6 Mb/s on the wire) to 890 KB/s (about 7.6 Mb/s on the wire). I know where the 11% was lost (an interaction between "sosend" and the fact that an AMD LANCE chip requires at least 100 bytes in the first chunk of data if you ask it to chain -- massive braindamage on AMD's part) and how to get it back (re-do the way user data gets handed to the transport protocol) but need to talk with Mike Karels about the "right" way to do this. Still, 890 KB/s represents a non-trivial improvement over the stock Sun/4bsd system: Under identical test conditions (same socket & user buffer sizes, same MSS and MTU, same machines), the best tcp throughput I could get with an out-the-box Sun OS 3.5 was 380 KB/s. I wish I could say "make these two simple changes and your throughput will double" but I can't. There were lots and lots of fiddley little changes, none made a huge difference and they all seemed to be necessary. The biggest single effect was a change to sosend (the routine between the user "write" syscall and tcp_output). Its loop looked something like: while there is user data & space in the socket buffer copy from user space to socket call the protocol "send" routine After hooking a scope to our ethernet cable & looking at the packet spacings, I changed this to while there is user data & space in the socket buffer copy up to 1K (one cluster's worth) from user space to socket call the protocol "send" routine and the throughput jumped from 380 to 456 KB/s (+20%). There's one school of thought that says the first loop was better because it minimized the "boundary crossings", the fixed costs of routine calls and context changes. This same school is always lobbying for "bigger": bigger packets, bigger windows, bigger buffers, for essentially the same reason: the bigger chunks are, the fewer boundary crossings you pay for. The correct school, mine :-), says there's always a fixed cost and a variable cost (e.g., the cost of maintaining tcp state and tacking a tcp packet header on the front of some data is independent of the amount of data; the cost of filling in the checksum field in that header scales linearly with the amount of data). If the size is large enough to make the fixed cost small compared to the variable cost, making things bigger LOWERS throughput because you throw away opportunities for parallelism. Looking at the ethernet, I saw a burst of packets, a long dead time, another burst of packets, ... . It was clear that while we were copying 4 KB from the user, the processor in the LANCE chip and tcp_input on the destination machine were mostly sitting idle. To get good network performance, where there are guaranteed to be many processors that could be doing things in parallel, you want the "units of work" (loop sizes, packet sizes, etc.) to be the SMALLEST values that amortise the fixed cost. In Berkeley Unix, the fixed costs of protocol processing are pretty low and sizes of 1 - 2 KB on a 68020 are as large as you'd want to get. (This is easy to determine. Just do a throughput vs. size test and look for the knee in the graph. Best performance is just to the right of the knee.) And, obviously, on a faster machine you'd probably want to do things in even smaller units (if the overhead stays the same -- Crays are fast but hardware strangeness drives the fixed costs way up. Just remember that if it takes large packets and large buffers to get good performance on a fast machine, that's because it's broken, not because it's fast.) Another big effect (difficult to quantify because several changes had to be made at once) was to remove lots of more-or-less hidden data copies from the protocol processing. It's a truism that you never copy data in network code. Just lay the data down in a buffer & pass around pointers to appropriate places in that buffer. But there are lots of places where you need to get from a pointer into the buffer back to a pointer to the buffer itself (e.g., you've got a pointer to the packet's IP header, there's a header checksum error, and you want to free the buffer holding the packet). The routine "dtom" converts a data pointer back to a buffer pointer but it only works for small things that fit in a single mbuf (the basic storage allocation unit in the bsd network code). Incoming packets are never in an mbuf; they're in a "cluster" which the mbuf points to. There's no way to go from a pointer into a cluster to a pointer to the mbuf. So, anywhere you might need to do a dtom (anywhere there's a detectable error), there had to be a call to "m_pullup" to make sure the dtom would work. (M_pullup works by allocating a fresh mbuf, copying a bunch of data from the cluster to this mbuf, then chaining the original cluster behind the new mbuf.) So, we were doing a bunch of copying to expedite error handling. But errors usually don't happen (if you're worried about performance, first you make sure there are very, very few errors), so we were doing a bunch of copying for nothing. But, if you're sufficiently insomniac, in about a week you can track down all the pullup's associated with all the dtom's and change things to avoid both. This requires massive recoding of both the TCP and IP re-assembly code. But it was worth it: TCP throughput climbed from around 600 KB/s to 750 KB/s and IP forwarding just screamed: A 3/60 forwarding packets at the 9 Mb/s effective ethernet bandwidth used less than 50% of the CPU. [BTW, in general I didn't find anything wrong with the BSD mbuf/cluster model. In fact, I tried some other models (e.g., do everything in packet sized chunks) and they were slower. There are many cases where knowing that you can grab an mbuf and chain it onto a chunk of data really simplifies the protocol code (simplicity == speed). And the level of indirection and fast, reference counted allocation of clusters can really be a win on data transfers (a la kudp_fastsend in Sun OS). The biggest problem I saw, other than the m_pullup's, was that clusters are too small: They need to be at least big enough for an ethernet packet (2K) and making them page sized (8K on a Sun) doesn't hurt and would let you do some quick page swap tricks in the user-system data copies (I didn't do any of these tricks in the fast TCP. I did use 2KB clusters to optimize things for the ethernet drivers).] An interesting result of the m_pullup removals was the death of another piece of folklore. I'd always heard that the limiting CPU was the receiver. Wrong. After the pullup changes, the sender would be maxed out at 100% CPU utilization while the receiver loafed along at 65-70% utilization (utilizations measured with a microprocessor analyzer; I don't trust the system's stats). In hindsight, this seems reasonable. At the receiver, a packet comes in, wanders up to the tcp layer, gets stuck in the socket buffer and an ack is generated (i.e., the processing terminates with tcp_input at the socket buffer). At the sender, the ack comes in, wanders up to the tcp layer, frees some space, then the higher level socket process has to be woken up to fill that space (i.e., the processing terminates with sosend, at the user socket layer). The way Unix works, this forces a boundary crossing between the bottom half (interrupt service) and top half (process context) of the kernel. On a Sun, and most of the other Unix boxes I know of, this is an expensive crossing. [Of course, the user process on the receiver side has to eventually wake up and empty the socket buffer but it gets to do that asynchronously and the dynamics tend to arrange themselves so it processes several packets on each wakeup, minimizing the boundary crossings.] Talking about the bottom half of the kernel reminds me of another major effect. There seemed to be a "sound barrier" at 550 KB/s. No matter what I did, the performance stuck at 550 KB/s. Finally, I noticed that Sun's LANCE ethernet driver, if_le.c, would only queue one packet to the LANCE at a time. Picture what this means: (1) driver hands packet to LANCE, (2) LANCE puts packet on wire, (3) end of packet, LANCE interrupts processor, (4) interrupt dispatched to driver, (5) go back to (1). The time involved in (4) is non-trivial, more than 150us, and represents a lot of idle time for the LANCE and the wire. So, I rewrote the driver to queue an arbitrary number of packets to the LANCE, the sound barrier disappeared, and other changes started making the throughput climb (it's not clear whether this change had any effect on throughput or just allowed other changes to have an effect). [Shortly after making the if_le change, I learned why Sun might have written the driver the silly way they did: Before the change, the 6 back-to-back IP fragments of an NFS write would each be separated by the 150us interrupt service time. After the change, they were really back-to-back, separated by only the 9.6us minimum ethernet spacing (and, no, Sun does NOT violate the ethernet spec in any way, shape or form. After my last message on this stuff, Apollo & DEC people kept telling me Sun was out of spec. I've been watching packets on our ethernet for so long, I'm starting to learn the middle name of every bit. Sun bits look like DEC bits and Sun, or rather, the LANCE in the 3/50 & 3/60, completely complys with the timings in the blue book.) Anyway, the brain-dead Intel 82586 ethernet chip Sun puts in all its 3/180, 3/280 and Sun-4 file servers can't hack back-to-back, minimum spacing packets. Every now and again it drops some of the frags and wanders off to never-never land ("iebark reset"). Diskless workstations don't work well when they can't write to their file server and, until I hit on the idea of inserting "DELAY" macros in kudp_fastsend, it looked like I could have a fast TCP or a functional workstation but not both.] Probably 30% of the performance improvements came from fixing things in the Sun kernel. I mean like, really, guys: If the current task doesn't change, and it doesn't 80% of the time swtch is called, there's no need to do a full context save and restore. Adding the two lines cmpl _masterprocp,a0 jeq 6f ü restore of current proc is easy just before the call to "resume" in sun3/vax.s:swtch got me a quick 70 KB/s performance increase but felt more like a bug fix than progress. And a kernel hacker that does 16-bit "movw"s and "addw"s on a 68020, or writes 2 instruction dbra loops designed to put a 68010 in loop mode, should be shot. The alu takes the same 3 clocks for a 2 byte add or a 4 byte add so things will finish a lot quicker if you give it 4 bytes at a time. And every branch stalls the pipe, so unrolling that loop to cut down on branches is a BIG win. Anyway, I recoded the checksum routine, ocsum.s (now oc_cksum.s because I found the old calling sequence wasn't the best way to do things) and its caller, in_cksum.c, and got the checksumming time down from 490us/KB to 130us/KB. Unrolling the move loop in copyin/copyout (the routines that move data user to kernel and kernel to user), got them down from 200us/KB to 140us/KB. (BTW, if you combine the move with the checksum, which I did in the kludged up, fast code that ran 1 MB/s on a 15MHz 3/50, it costs 200us/KB, not the 300us/KB you'd expect from adding the move and sum times. Pipelined processors are strange and wonderful beasts.) From these times, you can work out most of the budgets and processing details: I was using 1408 data byte packets (don't ask why 1408). It takes 193us to copy a packet's worth of data and 184us to checksum the packet and its 40 byte header. From the logic analyzer, the LANCE uses 300us of bus and memory bandwidth reading or writing the packet (I don't understand why, it should only take half this). So, the variable costs are around 700us per packet. When you add the 18 byte ethernet header and 12 byte interpacket gap, to run at 10 Mb/s I'd have to supply a new packet every 1200us. I.e., there's a budget of 500us for all the fixed costs (protocol processing, interrupt dispatch, device setup, etc.). This is do-able (I've done it, but not very cleanly) but what I'm getting today is a packet every 1500us. I.e., 800us per packet fixed costs. When I look with our analyzer, 30% of this is TCP, IP, ARP and ethernet protocol processing (this was 45% before the "header prediction" tcp mods), 15% is stalls (idle time that I don't currently understand but should be able to eventually get rid of) and 55% is device setup, interrupt service and task handling. I.e., protocol processing is negligible (240us per packet on this processor and this isn't a fast processor in today's terms). To make the network go faster, it seems we just need to fix the operating system parts we've always needed to fix: I/O service, interrupts, task switching and scheduling. Gee, what a surprise. - Van