From van@lbl-csam.arpa  Sun Dec  6 01:52:51 1987
Posted-Date: Sun, 06 Dec 87 01:54:34 PST
Received-Date: Sun, 6 Dec 87 01:52:51 PST
Received: from LBL-CSAM.ARPA by venera.isi.edu (5.54/5.51)
        id AA04300; Sun, 6 Dec 87 01:52:51 PST
Received: by lbl-csam.arpa (5.58/1.18)
        id AA11270; Sun, 6 Dec 87 01:54:36 PST
Message-Id: <8712060954.AA11270@lbl-csam.arpa>
To: end2end-interest@venera.isi.edu
Cc: karels@okeeffe.berkeley.edu
Subject: a tcp wide-window option
Date: Sun, 06 Dec 87 01:54:34 PST
From: Van Jacobson <van@lbl-csam.arpa>

I'm ready to write an rfc on a tcp option to allow windows
bigger than 64kb but I'd like some opinions on alternative
implementations first.  In all schemes, I envision an initial
option that tells the other side you can handle scaled windows in
incoming packets.  Then I see three alternative implementations
(given in order of my preference):

1. If both sides say they can handle scaled windows, the 
   window field is interpreted as an exponent & magnitude
   (described below).  Otherwise the the window would be
   interpreted as it is now.

2. If the other side has said it can handle scaled windows,
   at any time you can send a non-initial option that says
   "multiply the window in this packet and all future packets
   by this 16 bit quantity to get my real window".

3. If the other side has said it can handle scaled windows,
   at an time you can send a non-initial option that says
   "multiply the window in this packet by this 16 bit quantity
   to get my real window".


Option 1 in more detail:
------------------------
Although it's hard to envision the need for gigabyte windows, we
should probably anticipate the hardware types by designing a
mechanism to specify windows as large as the sequence space
permits.  With a 32 bit sequence number, things start to break
when

        2*max.win >= 2^31

so the largest window we could permit without expanding the
sequence space is 2^30 - 1 (1 gigabyte).  It would be nice if 
no scaling were necessary for the "usual" case, which probably
lies somewhere between 2kb and 16kb.  Let's say we want 16kb
(2^14) to be the unscaled limit.  That leaves us 2 bits for
an exponent and the base of that exponent has to cover a
range of 2^30/2^14 = 2^16.  Thus, N^3 = 2^16 or N ~= 2^5.

So, if we interpret the top 2 bits of the window field as an
exponent e and the bottom 14 bits as an unsigned mantissa m
(with the binary point to the right of the lsb), we calculate
the window as

        32^e * m

(i.e., we shift the bottom 14 bits left by 0, 5, 10 or 15 if
the top two bits are 0, 1, 2 or 3 respectively).  That lets
the receiver specify windows up to 2^29 - 2^15 bytes (about 512mb)


Random Comments
---------------
It's a shame to throw away a factor of two in potential window
size (512mb instead of 1gb) but making the base 64 instead of
32 increases the boundary losses (next paragraph) and making
the exponent 3 bits instead of 2 puts the threshhold for 
"costly" windows (paragraph after next) at 8kb rather than
16kb.  (Or maybe the notion of a gigabyte in transit scares me).

If the the receiver picks a random size for its buffer, some
precision will be lost when the mantissa is truncated to 14 bits.
This could reduce throughput (the sender would see a smaller
buffer than is actually available) but the effect could be at
most 2^5/2^14 = .2% (e.g., 31 bytes lost in a 16kb+31b buffer
truncated to a 16kb window) and couldn't happen unless the
available buffer were more than 16kb.  Presumably we'd try to
make our packet and buffer sizes be a power of 2 (or at least a
multiple of 32) if we're doing this large a bulk data transfer
thus in practice we wouldn't expect any throughput loss. 

Since option 1 doesn't let a receiver turn scaling on and off,
I imagine a tcp that knew how to scale would always send an
<I Can Scale> option with its SYN packets.  Thus the code that
put windows into and took windows out of packets would have
something like the following added:

        if (win >= 16384)
                if (other_side_can_scale)
                        (scale the window)

By picking the mantissa as large as possible, this new mechanism
adds only one compare per side per packet in the "usual case"
(a window < 16kb).

This proposal changes 4 fields in the 4bsd per-connection state
from shorts to ints (the receive window, send window, congestion
window and slow-start-to-dynamic-window threshhold).  That means
adding 8 bytes to a structure that has room for 2 more bytes.
We might have to get (god forbid) "clever" to make things fit
in a way that's compatible with 4.2 & 4.3bsd (the current bsd
can use kernel malloc rather than mbufs for the per-connection
state so it doesn't have a problem).


Comments, questions or alternate proposals would be welcome.

 - Van