From van@lbl-csam.arpa  Wed Dec 16 00:49:31 1987
Posted-Date: Wed, 16 Dec 87 00:51:24 PST
Received-Date: Wed, 16 Dec 87 00:49:31 PST
Received: from LBL-CSAM.ARPA by venera.isi.edu (5.54/5.51)
        id AA17690; Wed, 16 Dec 87 00:49:31 PST
Received: by lbl-csam.arpa (5.58/1.18)
        id AA04676; Wed, 16 Dec 87 00:51:25 PST
Message-Id: <8712160851.AA04676@lbl-csam.arpa>
To: braden@venera.isi.edu
Cc: end2end-interest@venera.isi.edu, karels@okeeffe.berkeley.edu
Subject: Re: a tcp wide-window option 
In-Reply-To: Your message of Mon, 07 Dec 87 11:52:15 PST.
Date: Wed, 16 Dec 87 00:51:24 PST
From: Van Jacobson <van@lbl-csam.arpa>

Bob,

I love the idea of combining large windows & selective ack into
one rfc.  While we're at it, can we add one more option: it would
be nice if a receiver could echo back the value of some sender
option (the tcp equivalent of an icmp echo request/ reply).  With
this it's pretty easy to do round-trip-timing on a large window
(the sender puts this option in each outgoing packet, making the
value be the current clock time).  Without this you either
maintain a LOT of timers (512 of them for a Wideband-sized 256KB
window) or use the current "measure-1-packet-per-window" scheme
and suffer horrible aliasing when the network utilization goes
up.  This echo option would only be used if the receiver set the
bit that said "I can echo" in the flags of the SYN option.  (I
can expand on how I think this option should work at the receiver
if there's interest -- if it's used for timing you have to specify
which option gets sent back in a delayed ack and what gets sent back
for the ack of a retransmit that fills a hole in the sequence space). 


I think wide-window proposal 2 was Dave Clark's counter to my
proposal 3 at the task force meeting.  I tried implementing proposal
2 when I got back from the meeting and ran into some problems:

  (a) some ack packets are different than others (they have
      this option).  Deciding when to tack on the option adds
      some new logic to the tcp_output mainline.

  (b) tcp delivers sequence space reliably, not packets.  Thus
      the packet containing the window scaling option could be
      dropped and it would need to be retransmitted.  But this
      is a retransmit from the RECEIVER side (the receiver is
      trying to tell the sender about a bigger window).  Figuring
      out when and how to do this retransmit seemed non-trivial.
      (Again, remember that the scale option is going rcvr to
      sender in ack packets.  Since an ack is cumulative, the
      first ack in the sequence
                ack 9, scale 1024
                ack 10
                ...
      could be dropped and the receiver wouldn't be able to tell.
      Since the "ack 10" implicitly acks 9, the sender behaves
      as if the "ack 9" had arrived).

  (c) I couldn't find anything in rfc793 that said "when" non-
      initial options are processed.  To preserve the semantics
      of the window, one wants the sequence
                ack 10, scale 10, win 5
                ack 11, scale 1, win 50
                ack 12, win 50
      processed in that order.  But at least one implementation,
      4bsd, processes options before sequencing the packet.  Thus
      if the net re-orders the 1st and 2nd acks, the 3rd ack will
      set the window to 500, not 50.

The only easy way I could see around these problems was to put
the scaling information in every packet.  That meant proposal 3
or proposal 1.  I was partial to proposal 1 because it didn't
change the header length (a factor when we try to handle faster
data rates).

The complication in proposal 1 may have been more in my explanation
than in the implementation.  Two lines are added to the window
handling code in tcp_input:

        if (win & mask)
                win = (win & mask) << wshift[(win & ~mask) >> 14];

where "mask" is set to 0xc000 if the other side said "I scale" and
zero otherwise, and wshift is a static array containing the 4 scale
factors: 0, 5, 10 and 15.

Four lines get added to the window handling code in tcp_output:

        if (win & mask)
                if (win < (1 << 19))
                        win = (win >> 5) | 0x4000;
                else if (win < (1 << 24))
                        win = (win >> 10) | 0x8000;
                else
                        win = (win >> 15) | 0xc000;

where "mask" is the same per-connection state variable used in
tcp_input.  (I'm sure others can come up with better code but
this gives you an idea of what's involved.) Proposals 2 & 3 are
simpler, of course, since the code is just "win *= scale;" at
input and "win /= scale;" at output.  But I think the performance
difference will be unmeasurable and the delta complication of 7
lines vs. 2 lines doesn't seem that large to me. 

 - Van

