From van@lbl-csam.arpa  Wed Jan 27 01:01:57 1988
Received-Date: Wed, 27 Jan 88 01:01:57 PST
Received: from vaxa.isi.edu by venera.isi.edu (5.54/5.51)
	id AA01117; Wed, 27 Jan 88 01:01:57 PST
Posted-Date: Tue, 26 Jan 88 22:13:39 PST
Received: from [26.1.0.34] by vaxa.isi.edu (5.54/5.51)
	id AA01155; Tue, 26 Jan 88 22:16:29 PST
Received: by lbl-csam.arpa (5.58/1.18)
	id AA21820; Tue, 26 Jan 88 22:13:41 PST
Message-Id: <8801270613.AA21820@lbl-csam.arpa>
To: dab%oliver.CRAY.COM@uc.msc.umn.edu (Dave Borman)
Cc: karels@okeeffe.berkeley.edu, kilian%oliver.CRAY.COM@uc.msc.umn.edu,
        santa%oliver.CRAY.COM@uc.msc.umn.edu, end2end-interest@venera.isi.edu
Subject: Re: Large TCP windows 
In-Reply-To: Your message of Mon, 25 Jan 88 13:49:03 CST.
Date: Tue, 26 Jan 88 22:13:39 PST
From: Van Jacobson <van@lbl-csam.arpa>
Status: R

Dave -

Bob Braden and I are still writing an RFC that will talk about
the large window option & some other options needed for paths
with a large bandwidth-delay product.  I can tell you what the
current thinking is but it won't be official until the RFC comes
out and Jon Postel assigns a number to the option. 

The window scaling proposal is actually due to Mike Karels in a
"it doesn't have to be that complicated" response to several
proposals I made.  The problem is that you have to make sure that
the other side implements scaling, that it has reliably received
your scale factor and that you have reliably received its scale
factor.  Since options are associated with a packet and tcp
delivers data reliably but not packets, it turns out to be
complicated to communicate a scale factor in the packet stream. 

The one packet that is sent reliably is a SYN packet.  That
suggests that each side should communicate a scaling value as an
initial option.  This has the possible disadvantage that you must
communicate your scale factor when the conversation is initiated,
before you know whether or not you'll need to scale the window.
(This is an issue because scaling reduces the control resolution
you can exercise via your window updates.  E.g., if you announce
a scale factor of 8 bits, you can advertise a 256 byte window or
a 0 byte window but nothing in between.  But, after considering what
our silly window code goes through, we decided this limitted
resolution was a feature, not a bug). 

So, a model implementation is the following:  The "other side's
window" field in the per-connection state must be expanded to a
32-bit field (if it isn't that big already).  [The "my advertised
window field should probably also be expanded to 32-bits but doesn't
need to be if you never plan to offer a non-zero scale factor.]
There are two window scale factors added to the state, sndscale
and rcvscale.  These are *shift counts* to be applied to outgoing
and incoming windows, respectively.  (With a footnote that the
window in SYN packets is never scaled.)  Sndscale and rcvscale are
initialized to zero.  All your outgoing SYN packets contain a
<Scale> option announcing the scaling you wish applied to your
windows.  (<Scale> is a 3 byte option: 1 byte of "kind" (TBD),
one byte of length (=3) and one byte of shift count.)

Sndscale and rcvscale can be changed only when you are processing
the other side's SYN packet.  If that packet contains a <Scale n>
option, set rcvscale to n and set sndscale the scale factor you
announced (or will announce) in your SYN packet.  Otherwise
*both* rcvscale and sndscale are left zero.  I.e., both sides have
to implement scaling for windows to be scaled. 

All incoming packets (with the exception of SYN packets) have
their window left shifted by rcvscale.  Using rfc793 names and C:
	rcv.wnd = seg.wnd << rcv.scale

All outgoing packets (with the exception of SYN packets) have
their window right shifted by sndscale:
	seg.wnd = snd.wnd >> snd.scale
Note that the offered window is (deliberately) truncated, not
rounded.

The value of the <Scale> option may be zero (this would let the
other side offer you scaled windows but leave your windows
unscaled).  Because there are problems with sequence space
wraparound when the window size exceeds 2^29, the largest
possible scale factor is 13 (29 - 16).  (This allows for up to
500MB windows - This number will undoubtedly be inadequate
shortly but to raise it we'll have to figure out a way to expand
the sequence space [scaling sequence numbers immediately springs
to mind].  We probably have a couple of months to test things
before we run into the 500MB limit so we should defer this
problem.)  We haven't discussed what to do if a scale factor
larger than 13 is received.  Alternatives I can think of are:

  a) log the error and set rcvscale to 13.
  b) treat it as an unrecoverable protocol error and RST the other side.
  c) log the error and set rcvscale to zero.

(The choices are given in order of my preference.  (c) is last
because I think it could lead to very silly windows if the other
side expected you to be scaling and because I expect scaling will
usually be implemented in tcp's that do some sort of dynamic
window algorithm and, therefore, won't destroy the network just
because the other side announced a 500MB window.  (a) is first
because it lets old and new implementations interoperate in the
event we figure a way around the 500MB limit.

We would probably suggest that <scale> be "as small as possible"
(5 or 6 in your case of 1MB packets) unless you know exactly what
you're doing (as an aside, at LBL we believe anyone claiming to
know exactly what they're doing is in serious need of psychiatric
help) and have the utmost confidence in your congestion control
algorithms (same aside as before). 

Under 4bsd, we will probably add a route entry field that enables
scaling and can be used to determine an appropriate scale factor
(e.g., a "pipesize" field that any protocol would be free to
interpret as a best-guess of the bandwidth-delay product for this
route -- we've already added fields for MTU, RTT, RTT variance
and a few other "path characteristics").  This would give system
administrators per-network and/or per-host control over the
window size used.  The default would be "no scaling" which means
you would *not* offer a <Scale> option to the other side.  I.e.,
both ends have to agree that scaling should be used over a
network before it is used.  This default is debatable but I think
it's conservative (I don't want cretins that decide to use 10MB
windows over the Arpanet to unilaterally be able to force my
machine to participate in the heinous crime). 

 - Van

ps:  I'm very interested in hearing more about the performance
  you're getting.  I don't really understand your statement that
  "Obviously, the limiting factor here is the overhead of the protocol."
  Which protocol?  In a non-yet-released bsd implementation, the
  tcp & ip protocol overhead is damn near zero (a checksum, a few
  compares then the data copy).  With a little care in the checksum
  and copy, even a Sun-3/50 driving an ethernet at full bandwidth
  (~ 8Mb/s) loafs along at 20-30% cpu utilization (since the net
  bandwidth is 10% of the machine's memory bandwidth, this is
  within a factor of two of the absolute minimum utilization,
  assuming that you don't violate the protocol by disabling checksums
  and that the network forces packets to be smaller than the page size).

  You said 30KB packets gave 60Mb/s and turning off checksumming
  increased that to 70Mb/s.  If I make a wild guess that you sum
  8 bytes at a time and don't use the vector pipeline, then removing
  4096 memory reference instructions made only a 15% difference.
  If I guess that there's a copy involved in the packet processing
  (a load and a store) that accounts for another 30%.  That
  leaves 55% or about 15,000 memory-reference-instruction-equivalents
  as the per-packet processing overhead (i.e., the overhead that's
  independent of packet length).

  Particularly given our recent experience, I can't believe that the
  tcp & ip protocol processing is taking all that time.  In my younger
  days I was forced to do some hacking on the Hyperchannel attached
  to our CDC-7600.  It's been a long time but I vaguely recall that
  the thing was, shall we say, intractable.  Is that where the time
  goes or is it into context switching, a long media-to-processor
  pipeline, or something else?  (Or did I botch my arithmetic?)