Network Working Group R. Braden Request for Comments: 1379 ISI November 1992 Extending TCP for Transactions -- Concepts Status of This Memo This memo provides information for the Internet community. It does not specify an Internet standard. Distribution of this memo is unlimited. Abstract This memo discusses extension of TCP to provide transaction-oriented service, without altering its virtual-circuit operation. This extension would fill the large gap between connection-oriented TCP and datagram-based UDP, allowing TCP to efficiently perform many applications for which UDP is currently used. A separate memo contains a detailed functional specification for this proposed extension. This work was supported in part by the National Science Foundation under Grant Number NCR-8922231. TABLE OF CONTENTS 1. INTRODUCTION .................................................. 2 2. TRANSACTIONS USING STANDARD TCP ............................... 3 3. BYPASSING THE 3-WAY HANDSHAKE ................................. 6 3.1 Concept of TAO ........................................... 6 3.2 Cache Initialization ..................................... 10 3.3 Accepting Segments ............................. 11 4. SHORTENING TIME-WAIT STATE .................................... 13 5. CHOOSING A MONOTONIC SEQUENCE ................................. 15 5.1 Cached Timestamps ........................................ 16 5.2 Current TCP Sequence Numbers ............................. 18 5.3 64-bit Sequence Numbers .................................. 20 5.4 Connection Counts ........................................ 20 5.5 Conclusions .............................................. 21 6. CONNECTION STATES ............................................. 24 7. CONCLUSIONS AND ACKNOWLEDGMENTS ............................... 32 APPENDIX A: TIME-WAIT STATE AND THE 2-PACKET EXCHANGE ............ 34 REFERENCES ....................................................... 37 Security Considerations .......................................... 38 Author's Address ................................................. 38 Braden [Page 1] RFC 1379 Transaction TCP -- Concepts November 1992 1. INTRODUCTION The TCP protocol [STD-007] implements a virtual-circuit transport service that provides reliable and ordered data delivery over a full-duplex connection. Under the virtual circuit model, the life of a connection is divided into three distinct phases: (1) opening the connection to create a full-duplex byte stream; (2) transferring data in one or both directions over this stream; and (3) closing the connection. Remote login and file transfer are examples of applications that are well suited to virtual-circuit service. Distributed applications, which are becoming increasingly numerous and sophisticated in the Internet, tend to use a transaction-oriented rather than a virtual circuit style of communication. Currently, a transaction-oriented Internet application must choose to suffer the overhead of opening and closing TCP connections or else build an application-specific transport mechanism on top of the connectionless transport protocol UDP. Greater convenience, uniformity, and efficiency would result from widely-available kernel implementations of a transport protocol supporting a transaction service model [RFC- 955]. The transaction service model has the following features: * The fundamental interaction is a request followed by a response. * An explicit open or close phase would impose excessive overhead. * At-most-once semantics is required; that is, a transaction must not be "replayed" by a duplicate request packet. * In favorable circumstances, a reliable request/response handshake can be performed with exactly one packet in each direction. * The minimum transaction latency for a client is RTT + SPT, where RTT is the round-trip time and SPT is the server processing time. We use the term "transaction transport protocol" for a transport- layer protocol that follows this model [RFC-955]. The Internet architecture allows an arbitrary collection of transport protocols to be defined on top of the minimal end-to-end datagram service provided by IP [Clark88]. In practice, however, production systems implement only TCP and UDP at the transport layer. It has proven difficult to leverage a new transport protocol into place, to be widely enough available to be useful for application builders. Braden [Page 2] RFC 1379 Transaction TCP -- Concepts November 1992 This memo explores an alternative approach to providing a transaction transport protocol: extending TCP to implement the transaction service model, while continuing to support the virtual circuit model. Each transaction will then be a single instance of a TCP connection. The proposed transaction extension is effectively implementable within current TCPs and operating systems, and it should also scale to the much faster networks, interfaces, and CPUs of the future. The present memo explains the theory behind the extension, in somewhat exquisite detail. Despite the length and complexity of this memo, the TCP extensions required for transactions are in fact quite limited and simple. Another memo [TTCP-FS] provides a self-contained functional specification of the extensions. Section 2 of this memo describes the limitations of standard TCP for transaction processing, to motivate the extensions. Sections 3, 4, and 5 explore the fundamental extensions that are required for transactions. Section 6 discusses the changes required in the TCP connection state diagram. Finally, Section 7 presents conclusions and acknowledgments. Familiarity with the standard TCP protocol [STD-007] is assumed. 2. TRANSACTIONS USING STANDARD TCP Reliable transfer of data depends upon sequence numbers. Before data transfer can begin, both parties must "synchronize" the connection, i.e, agree on common sequence numbers. The synchronization procedure must preserve at-most-once semantics, i.e., be free from replay hazards due to duplicate packets. The TCP developers adopted a synchronization mechanism known as the 3-way handshake. Consider a simple transaction in which client host A sends a single- segment request to server host B, and B returns a single-segment response. Many current TCP implementations use at least ten segments (i.e., packets) for this sequence: three for the 3-way handshake opening the connection, four to send and acknowledge the request and response data, and three for TCP's full-duplex data-conserving close sequence. These ten segments represent a high relative overhead for two data-bearing segments. However, a more important consideration is the transaction latency seen by the client: 2*RTT + SPT, larger than the minimum by one RTT. As CPU and network speeds increase, the relative significance of this extra transaction latency also increases. Proposed transaction transport protocols have typically used a "timer-based" approach to connection synchronization [Birrell84]. In this approach, once end-to-end connection state is established in the client and server hosts, a subset of this state is maintained for Braden [Page 3] RFC 1379 Transaction TCP -- Concepts November 1992 some period of time. A new request before the expiration of this timeout period can then reestablish the full state without an explicit handshake. Watson pointed out that the timer-based approach of his Delta-T protocol [Watson81] would encompass both virtual circuits and transactions. However, the TCP group adopted the 3-way handshake (because of uncertainty about the robustness of enforcing the packet lifetime bounds required by Delta-T, within a general Internet environment). More recently, Liskov, Shrira, and Wroclawski [Liskov90] have proposed a different timer-based approach to connection synchronization, requiring loosely-synchronized clocks in the hosts. The technique proposed in this memo, suggested by Clark [Clark89], depends upon cacheing of connection state but not upon clocks or timers; it is described in Section 3 below. Garlick, Rom, and Postel also proposed a connection synchronization mechanism using cached state [Garlick77]. Their scheme required each host to maintain connection records containing the highest sequence number on each connection. The technique suggested here retains only per-host state, not per-connection state. During TCP development, it was suggested that TCP could support transactions with data segments containing both SYN and FIN bits. (These "Kamikaze" segments were not supported as a service; they were used mainly to crash other experimental TCPs!) To illustrate this idea, Figure 1 shows a plausible application of the current TCP rules to create a minimal transaction. (In fact, some minor adjustments in the standard TCP spec would be required to make Figure 1 fully legal [STD-007]). Figure 1, like many of the examples shown in this memo, uses an abbreviated form to illustrate segment sequences. For clarity and brevity, it omits explicit sequence and acknowledgment numbers, assuming that these will follow the well-known TCP rules. The notation "ACK(x)" implies a cumulative acknowledgment for the control bit or data "x" and everything preceding "x" in the sequence space. The referent of "x" should be clear from the context. Also, host A will always be the client and host B will be the server in these diagrams. The first three segments in Figure 1 implement the standard TCP three-way handshake. If segment #1 had been an old duplicate, the client side would have sent an RST (Reset) bit in segment #3, terminating the sequence. The request data included on the initial SYN segment cannot be delivered to user B until segment #3 completes the 3-way handshake. Loading control bits onto the segments has reduced the total number of segments to 5, but the client still observes a transaction latency of 2*RTT + SPT. The 3-way handshake Braden [Page 4] RFC 1379 Transaction TCP -- Concepts November 1992 thus precludes high-performance transaction processing. TCP A (Client) TCP B (Server) _______________ ______________ CLOSED LISTEN (Client sends request) 1. SYN-SENT --> --> SYN-RCVD (data1 queued) 2. ESTABLISHED <-- <-- SYN-RCVD 3. FIN-WAIT-1 --> --> CLOSE-WAIT (data1 to server) (Server sends reply) 4. TIME-WAIT <-- <-- LAST-ACK (data2 to client) 5. TIME-WAIT --> --> CLOSED (timeout) CLOSED Figure 1: Transaction Sequence: RFC-793 TCP The TCP close sequence also poses a performance problem for transactions: one or both end(s) of a closed connection must remain in "TIME-WAIT" state until a 4 minute timeout has expired [STD-007]. The same connection (defined by the host and port numbers at both ends) cannot be reopened until this delay has expired. Because of TIME-WAIT state, a client program should choose a new local port number (i.e., a different connection) for each successive transaction. However, the TCP port field of 16 bits (less the "well-known" port space) provides only 64512 available user ports. This limits the total rate of transactions between any pair of hosts to a maximum of 64512/240 = 268 per second. This is much too low a rate for low-delay paths, e.g., high-speed LANs. A high rate of short connections (i.e., transactions) could also lead to excessive consumption of kernel memory by connection control blocks in TIME- WAIT state. In summary, to perform efficient transaction processing in TCP, we need to suppress the 3-way handshake and to shorten TIME-WAIT state. Braden [Page 5] RFC 1379 Transaction TCP -- Concepts November 1992 Protocol mechanisms to accomplish these two goals are discussed in Sections 3 and 4, respectively. Both require the choice of a monotonic sequence-like space; Section 5 analyzes the choices and makes a selection for this space. Finally, the TCP connection state machine must be extended as described in Section 6. Transaction processing in TCP raises some other protocol issues, which are discussed in the functional specification memo [TTCP-FS]. These include: (1) augmenting the user interface for transactions, (2) delaying acknowledgment segments to allow maximum piggy-backing of control bits with data, (3) measuring the retransmission timeout time (RTO) on very short connections, and (4) providing an initial server window. A recently proposed set of enhancements [RFC-1323] defines a TCP Timestamps option that carries two 32-bit timestamp values. The Timestamps option is used to accurately measure round-trip time (RTT). The same option is also used in a procedure known as "PAWS" (Protect Againsts Wrapped Sequence) to prevent erroneous data delivery due to a combination of old duplicate segments and sequence number reuse at very high bandwidths. The particular approach to transactions chosen in this memo does not require the RFC-1323 enhancements; however, they are important and should be implemented in every TCP, with or without the transaction extensions described here. 3. BYPASSING THE 3-WAY HANDSHAKE To avoid 3-way handshakes for transactions, we introduce a new mechanism for validating initial SYN segments, i.e., for enforcing at-most-once semantics without a 3-way handshake. We refer to this as the TCP Accelerated Open, or TAO, mechanism. 3.1 Concept of TAO The basis of TAO is this: a TCP uses cached per-host information to immediately validate new SYNs [Clark89]. If this validation fails, e.g., because there is no current cached state or the segment is an old duplicate, the procedure falls back to a normal 3-way handshake to validate the SYN. Thus, bypassing a 3-way handshake is considered to be an optional optimization. Braden [Page 6] RFC 1379 Transaction TCP -- Concepts November 1992 The proposed TAO mechanism uses a finite sequence-like space of values that increase monotonically with successive transactions (connections) between a given (client, server) host pair. Call this monotonic space M, and let each initial SYN segment carry an M value SEG.M. If M is not the existing sequence (SEG.SEQ) field, SEG.M may be carried in a TCP option. When host B receives from host A an initial SYN segment containing a new value SEG.M, host B compares this against cache.M[A], the latest M value that B has cached for host A. This comparison is the "TAO test". Because the M values are monotonically increasing, SEG.M > cache.M[A] implies that the SYN must be new and can be accepted immediately. If not, a normal 3-way handshake is performed to validate the initial SYN segment. Figure 2 illustrates the TAO mechanism; cached M values are shown enclosed in square brackets. The M values generated by host A satisfy x0 < x1, and the M values generated by host B satisfy y0 < y1. An appropriate choice for the M value space is discussed in Section 5. M values are drawn from a finite number space, so inequalities must be defined in the usual way for sequence numbers [STD-007]. The M space must not wrap so quickly that an old duplicate SYN will be erroneously accepted. We assume that some maximum segment lifetime (MSL) is enforced by the IP layer. ____T_C_P__A_____ ____T_C_P__B_____ cache.M[B] cache.M[A] V V [ y0 ] [ x0 ] 1. --> --> ( (x1 > x0) => data1 -> user_B; cache.M[A]= x1) [ y0 ] [ x1 ] 2. <-- <-- (data2 -> user_A, cache.M[B]= y1) [ y1 ] [ x1 ] ... (etc.) ... Figure 2. TAO: Three-Way Handshake is Bypassed Braden [Page 7] RFC 1379 Transaction TCP -- Concepts November 1992 Figure 2 shows the simplest case: each side has cached the latest M value of the other, and the SEG.M value in the client's SYN segment is greater than the value in the cache at the server host. As a result, B can accept the client A's request data1 immediately and pass it to the server application. B's reply data2 is shown piggybacked on the segment. As a result of this 2-way exchange, the cached M values are updated at both sites; the client side becomes relevant only if the client/server roles reverse. Validation of the segment at host A is discussed later. Figure 3 shows the TAO test failing but the consequent 3-way handshake succeeding. B updates its cache with the value x2 >= x1 when the initial SYN is known to be valid. _T_C_P__A _T_C_P__B cache.M[B] cache.M[A] V V [ y0 ] [ x0 ] 1. --> --> ( (x1 <= x0) => data1 queued; 3-way handshake) [ y0 ] [ x0 ] 2. <-- <-- (cache.M[B]= y1) [ y1 ] [ x0 ] 3. --> --> (Handshake OK => data1->user_B, cache.M[A]= x2) [ y1 ] [ x2 ] ... (etc.) ... Figure 3. TAO Test Fails but 3-Way Handshake Succeeds. There are several possible causes for a TAO test failure on a legitimate new SYN segment (not an old duplicate). (1) There may be no cached M value for this particular client host. (2) The SYN may be the one of a set of nearly-simultaneous SYNs for different connections but from the same host, which Braden [Page 8] RFC 1379 Transaction TCP -- Concepts November 1992 arrived out of order. (3) The finite M space may have wrapped around between successive transactions from the same client. (4) The M values may advance too slowly for closely-spaced transactions. None of these TAO failures will cause a lockout, because the resulting 3-way handshake will succeed. Note that the first transaction between a given host pair will always require a 3-way handshake; subsequent transactions can take advantage of TAO. The per-host cache required by TAO is highly desirable for other reasons, e.g., to retain the measured round trip time and MTU for a given remote host. Furthermore, a host should already have a per-host routing cache [HR-COMM] that should be easily extensible for this purpose. Figure 4 illustrates a complete TCP transaction sequence using the TAO mechanism. Bypassing the 3-way handshake leads to new connection states; Figure 4 shows three of them, "SYN-SENT*", "CLOSE-WAIT*", and "LAST-ACK*". Explanation of these states is deferred to Section 6. TCP A (Client) TCP B (Server) _______________ ______________ CLOSED LISTEN 1. SYN-SENT* --> --> CLOSE-WAIT* (TAO test OK=> data1->user_B) <-- <-- LAST-ACK* 2. TIME-WAIT (data2->user_A) 3. TIME-WAIT --> --> CLOSED (timeout) CLOSED Figure 4: Minimal Transaction Sequence Using TAO Braden [Page 9] RFC 1379 Transaction TCP -- Concepts November 1992 3.2 Cache Initialization The first connection between hosts A and B will find no cached state at one or both ends, so both M caches must be initialized. This requires that the first transaction carry a specially marked SEG.M value, which we call SEG.M.NEW. Receiving a SEG.M.NEW value in an initial SYN segment, B will cache this value and send its own M back to initialize A's cache. When a host crashes and restarts, all its cached M values cache.M[*] must be invalidated in order to force a re-synchronization of the caches at both ends. This cache synchronization procedure is illustrated in Figure 5, where client host A has crashed and restarted with its cache entries undefined, as indicated by "??". Since cache.TS[B] is undefined, A sends a SEG.M.NEW value instead of SEG.M in the segment of its first transaction request to B. Receiving this SEG.M.NEW, the server host B invalidates cache.TS[A] and performs a 3-way handshake. SEG.M in segment #2 updates A's cache, and when the handshake completes successfully, B updates its cached M value to x2 >= x1. _T_C_P__A _T_C_P__B cache.M[B] cache.M[A] V V [ ?? ] [ x0 ] 1. --> --> (invalidate cache; queue data1; [ ?? ] 3-way handshake) [ ?? ] 2. <-- <-- (cache.M[B]= y1) [ y1 ] [ ?? ] 3. --> --> data1->user_B, cache.M[A]= x2) [ y1 ] [ x2 ] ... (etc.) ... Figure 5. Client Host Crashed Suppose that the 3-way handshake failed, presumably because Braden [Page 10] RFC 1379 Transaction TCP -- Concepts November 1992 segment #1 was an old duplicate. Then segment #3 from host A would be an RST segment, with the result that both side's caches would be left undefined. Figure 6 shows the procedure when the server crashes and restarts. Upon receiving a segment from a host for which it has no cached M value, B initiates a 3-way handshake to validate the request and sends its own M value to A. Again the result is to update cached M values on both sides. _T_C_P__A _T_C_P__B cache.M[B] cache.M[A] V V [ y0 ] [ ?? ] 1. --> --> (data1 queued; 3-way handshake) [ y0 ] [ ?? ] 2. <-- <-- (cache.M[B]= y1) [ y1 ] [ ?? ] 3. --> --> (data1->user_B, cache.M[A]= x2) [ y1 ] [ x2 ] ... (etc.) ... Figure 6. Server Host Crashed 3.3 Accepting Segments Transactions introduce a new hazard of erroneously accepting an old duplicate segment. To be acceptable, a segment must arrive in SYN-SENT state, and its ACK field must acknowledge something that was sent. In current TCPs the effective send window in SYN-SENT state is exactly one octet, and an acceptable must exactly ACK this one octet. The clock-driven selection of Initial Sequence Number (ISN) makes an erroneous acceptance exceedingly unlikely. An old duplicate SYN could be accepted erroneously only if successive connection attempts occurred more often than once every 4 microseconds, or if the segment lifetime exceeded the 4 hour wraparound time for ISN Braden [Page 11] RFC 1379 Transaction TCP -- Concepts November 1992 selection. However, when TCP is used for transactions, data sent with the initial SYN increases the range of sequence numbers that have been sent. This increases the danger of accepting an old duplicate segment, and the consequences are more serious. In the example in Figure 7, segments 1-3 form a normal transaction sequence, and segment 4 begins a new transaction (incarnation) for the same connection. Segment #5 is a duplicate of segment #2 from the preceding transaction. Although the new transaction has a larger ISN, the previous ACK value 402 falls into the new range [200,700) of sequence numbers that have been sent, so segment #5 could be erroneously accepted and passed to the client as the response to the new request. _T_C_P__A _T_C_P__B CLOSED LISTEN 1. --> --> (TAO test OK) 2. <-- <-- 3. TIME-WAIT --> --> CLOSED (short timeout) CLOSED (New Request) 4. --> --> ... (Duplicate of segment #2) 5. <-- <--... (Acceptable!!) Figure 7: Old Duplicate Causing Error Unfortunately, we cannot simply use TAO on the client side to detect and reject old duplicate segments. A TAO test at the client might fail for a valid segment, due to out- of-order delivery, and this could result in permanent non-delivery of a valid transaction reply. Instead, we include a second M value, an echo of the client's M value from the initial segment, in the segment. A Braden [Page 12] RFC 1379 Transaction TCP -- Concepts November 1992 specially-marked M value, SEG.M.ECHO, is used for this purpose. The client knows the value it sent in the initial and can therefore positively validate the using the echoed value. This is illustrated in Figure 12, which is the same as Figure 4 with the addition of the echoed value on the segment #2. It should be noted that TCP allows a simultaneous open sequence in which both sides send and receive an initial (see Figure 8 of [STD-007]. In this case, the TAO test must be performed on both sides to preserve the symmetry. See [TTCP-FS] for an example. 4. SHORTENING TIME-WAIT STATE Once a transaction has been initiated for a particular connection (pair of ports) between a given host pair, a new transaction for the same connection cannot take place for a time that is at least: RTT + SPT + TIME-WAIT_delay Since the client host can cycle among the 64512 available port numbers, an upper bound on the transaction rate between a particular host pair is: [1] TRmax = 64512 /(RTT + TIME-WAIT_Delay) in transactions per second (Tps), where we assumed SPT is negligible. We must reduce TIME-WAIT_Delay to support high-rate TCP transaction processing. TIME-WAIT state performs two functions: (1) supporting the full- duplex reliable close of TCP, and (2) allowing old duplicate segments from an earlier connection incarnation to expire before they can cause an error (see Appendix to [RFC-1185]). The first function impacts the application model of a TCP connection, which we would not want to change. The second is part of the fundamental machinery of TCP reliable delivery; to safely truncate TIME-WAIT state, we must provide another means to exclude duplicate packets from earlier incarnations of the connection. To minimize the delay in TIME-WAIT state while performing both functions, we propose to set the TIME-WAIT delay to: [2] TIME-WAIT_Delay = max( K*RTO, U ) where U and K are constants and RTO is the dynamically-determined retransmission timeout, the measured RTT plus an allowance for the Braden [Page 13] RFC 1379 Transaction TCP -- Concepts November 1992 RTT variance [Jacobson88]. We choose K large enough so that there is high probability of the close completing successfully if at all possible; K = 8 seems reasonable. This takes care of the first function of TIME-WAIT state. In a real implementation, there may be a minimum RTO value Tr, corresponding to the precision of RTO calculation. For example, in the popular BSD implementation of TCP, the minimum RTO is Tr = 0.5 second. Assuming K = 8 and U = 0, Eqns [1] and [2] impose an upper limit of TRmax = 16K Tps on the transaction rate of these implementations. It is possible to have many short connections only if RTO is very small, in which case the TIME-WAIT delay [2] reduces to U. To accelerate the close sequence, we need to reduce U below the MSL enforced by the IP layer, without introducing a hazard from old duplicate segments. For this purpose, we introduce another monotonic number sequence; call it X. X values are required to be monotonic between successive connection incarnations; depending upon the choice of the X space (see Section 5), X values may also increase during a connection. A value from the X space is to be carried in every segment, and a segment is rejected if it is received with an X value smaller than the largest X value received. This mechanism does not use a cache; the largest X value is maintained in the TCP connection control block (TCB) for each connection. The value of U depends upon the choice for the X space, discussed in the next section. If X is time-like, U can be set to twice the time granularity (i.e, twice the minimum "tick" time) of X. The TIME-WAIT delay will then ensure that current X values do not overlap the X values of earlier incarnations of the same connection. Another consequence of time-like X values is the possibility that an open but idle connection might allow the X value to wrap its sign bit, resulting in a lockup of the connection. To prevent this, a 24-day idle timer on each open connection could bypass the X check on the first segment following the idle period, for example. In practice, many implementations have keep-alive mechanisms that prevent such long idle periods [RFC-1323]. Referring back to Figure 4, our proposed transaction extension results in a minimum exchange of 3 packets. Segment #3, the final ACK segment, does not increase transaction latency, but in combination with the TIME-WAIT delay of K*RTO it ensures that the server side of the connection will be closed before a new transaction is issued for this same pair of ports. It also provides an RTT measurement for the server. We may ask whether it would be possible to further reduce the TIME- Braden [Page 14] RFC 1379 Transaction TCP -- Concepts November 1992 WAIT delay. We might set K to zero; alternatively, we might allow the client TCP to start a new transaction request while the connection was still in TIME-WAIT state, with the new initial SYN acting as an implied acknowledgment of the previous FIN. Appendix A summarizes the issues raised by these alternatives, which we call "truncating" TIME-WAIT state, and suggests some possible solutions. Further study would be required, but these solutions appear to bend the theory and/or implementations of the TCP protocol farther than we wish to bend them. We therefore propose using formula [2] with K=8 and retaining the final ACK(FIN) transmission. To raise the transaction rate, therefore, we require small values of RTO and U. 5. CHOOSING A MONOTONIC SEQUENCE For simplicity, we want the monotonic sequence X used for shortening TIME-WAIT state to be identical to the monotonic sequence M for bypassing the 3-way handshake. Calling the common space M, we will send an M value SEG.M in each TCP segment. Upon receipt of an initial SYN segment, SEG.M will be compared with a per-host cached value to authenticate the SYN without a 3-way handshake; this is the TAO mechanism. Upon receipt of a non-SYN segment, SEG.M will be compared with the current value in the connection control block and used to discard old duplicates. Note that the situation with TIME-WAIT state differs from that of bypassing 3-way handshakes in two ways: (a) TIME-WAIT requires duplicate detection on every segment vs. only on SYN segments, and (b) TIME-WAIT applies to a single connection vs. being global across all connections. This section discusses possible choices for the common monotonic sequence. The SEG.M values must satisfy the following requirements. * The values must be monotonic; this requirement is defined more precisely below. * Their granularity must be fine-grained enough to support a high rate of transaction processing; the M clock must "tick" at least once between successive transactions. * Their range (wrap-around time) must be great enough to allow a realistic MSL to be enforced by the network. The TCP spec calls for an MSL of 120 secs. Since much of the Internet does not carefully enforce this limit, it would be safer to have an MSL at least an order of magnitude larger. We set as an Braden [Page 15] RFC 1379 Transaction TCP -- Concepts November 1992 objective an MSL of at least 2000 seconds. If there were no TIME- WAIT delay, the ultimate limit on transaction rate would be set by speed-of-light delays in the network and by the latency of host operating systems. As the bottleneck problems with interfacing CPUs to gigabit LANs are solved, we can imagine transaction durations as short as 1 microsecond. Therefore, we set an ultimate performance goal of TRmax at least 10**6 Tps. A particular connection between hosts A and B is identified by the local and remote TCP "sockets", i.e., by the quadruplet: {A, B, Port.A, Port.B}. Imagine that each host keeps a count CC of the number of TCP connections it has initiated. We can use this CC number to distinguish different incarnations of the same connection. Then a particular SEG.M value may be labeled implicitly by 6 quantities: {A, B, Port.A, Port.B, CC, n}, where n is the byte offset of that segment within the connection incarnation. To bypass the 3-way handshake, we require thgt SEG.M values on successive SYN segments from a host A to a host B be monotone increasing. If CC' > CC, then we require that: SEG.M(A,B,Port.A,Port.B,CC',0) > SEG.M(A,B,Port.A,Port.B,CC,0) for any legal values of Port.A and Port.B. To delete old duplicates (allowing TIME-WAIT state to be shortened), we require that SEG.M values be disjoint across different incarnations of the same connection. If CC' > CC then SEG.M(A,B,Port.A,Port.B,CC',n') > SEG.M(A,B,Port.A,Port.B,CC,n), for any non-negative integers n and n'. We now consider four different choices for the common monotonic space: RFC-1323 timestamps, TCP sequence numbers, the connection count, and 64-bit TCP sequence numbers. The results are summarized in Table I. 5.1 Cached Timestamps The PAWS mechanism [RFC-1323] uses TCP "timestamps" as monotonically increasing integers in order to throw out old duplicate segments within the same incarnation. Jacobson suggested the cacheing of these timestamps for bypassing 3-way handshakes [Jacobson90], i.e., that TCP timestamps be used for our common monotonic space M. This idea is attractive since it would allow the same timestamp options to be used for RTTM, PAWS, and transactions. Braden [Page 16] RFC 1379 Transaction TCP -- Concepts November 1992 To obtain at-most-once service, the criterion for immediate acceptance of a SYN must be that SEG.M is strictly greater than the cached M value. That is, to be useful for bypassing 3-way handshakes, the timestamp clock must tick at least once between any two successive transactions between the same pair of hosts (even if different ports are used). Hence, the timestamp clock rate would determine TRmax, the maximum possible transaction rate. Unfortunately, the timestamp clock frequency called for by RFC- 1323, in the range 1 sec to 1 ms, is much too slow for transactions. The TCP timestamp period was chosen to be comparable to the fundamental interval for computing and scheduling retransmission timeouts; this is generally in the range of 1 sec. to 1 ms., and in many operating systems, much closer to 1 second. Although it would be possible to increase the timestamp clock frequency by several orders of magnitude, to do so would make implementation more difficult, and on some systems excessively expensive. The wraparound time for TCP timestamps, at least 24 days, causes no problem for transactions. The PAWS mechanism uses TCP timestamps to protect against old duplicate non-SYN segments from the same incarnation [RFC-1323]. It can also be used to protect against old duplicate data segments from earlier incarnations (and therefore allow shortening of TIME-WAIT state) if we can ensure that the timestamp clock ticks at least once between the end of one incarnation and the beginning of the next. This can be achieved by setting U = 2 seconds, i.e., to twice the maximum timestamp clock period. This value in formula [2] leads to an upper bound TRmax = 32K Tps between a host pair. However, as pointed out above, old duplicate SYN detection using timestamps leads to a smaller transaction rate bound, 1 Tps, which is unacceptable. In addition, the timestamp approach is imperfect; it allows old ACK segments to enter the new connection where they can cause a disconnect. This happens because old duplicate ACKs that arrive during TIME-WAIT state generate new ACKs with the current timestamp [RFC-1337]. We therefore conclude that timestamps are not adequate as the monotonic space M; see Table I. However, they may still be useful to effectively extend some other monotonic number space, just as they are used in PAWS to extend the TCP sequence number space. This is discussed below. Braden [Page 17] RFC 1379 Transaction TCP -- Concepts November 1992 5.2 Current TCP Sequence Numbers It is useful to understand why the existing 32-bit TCP sequence numbers do not form an appropriate monotonic space for transactions. The sequence number sent in an initial SYN is called the Initial Sequence Number or ISN. According to the TCP specification, an ISN is to be selected using: [3] ISN = (R*T) mod 2**32 where T is the real time in seconds (from an arbitrary origin, fixed when the system is started) and R is a constant, currently 250 KBps. These ISN values form a monotonic time sequence that wraps in 4.55 hours = 16380 seconds and has a granularity of 4 usecs. For transaction rates up to roughly 250K Tps, the ISN value calculated by formula [3] will be monotonic and could be used for bypassing the 3-way handshake. However, TCP sequence numbers (alone) could not be used to shorten TIME-WAIT state, because there are several ways that overlap of the sequence space of successive incarnations can occur (as described in Appendix to [RFC-1185]). One way is a "fast connection", with a transfer rate greater than R; another is a "long" connection, with a duration of approximately 4.55 hours. TIME-WAIT delay is necessary to protect against these cases. With the official delay of 240 seconds, formula [1] implies a upper bound (as RTT -> 0) of TRmax = 268 Tps; with our target MSL of 2000 sec, TRmax = 32 Tps. These values are unacceptably low. To improve this transaction rate, we could use TCP timestamps to effectively extend the range of the TCP sequence numbers. Timestamps would guard against sequence number wrap-around and thereby allow us to increase R in [3] to exceed the maximum possible transfer rate. Then sequence numbers for successive incarnations could not overlap. Timestamps would also provide safety with an MSL as large as 24 days. We could then set U = 0 in the TIME-WAIT delay calculation [2]. For example, R = 10**9 Bps leads to TRmax <= 10**9 Tps. See 2(b) in Table I. These values would more than satisfy our objectives. We should make clear how this proposal, sequence numbers plus timestamps, differs from the timestamps alone discussed (and rejected) in the previous section. The difference lies in what is cached and tested for TAO; the proposal here is to cache and test BOTH the latest TCP sequence number and the latest TCP timestamp. In effect, we are proposing to use timestamps to logically extend Braden [Page 18] RFC 1379 Transaction TCP -- Concepts November 1992 the sequence space to 64 bits. Another alternative, presented in the next section, is to directly expand the TCP sequence space to 64 bits. Unfortunately, the proposed solution (TCP sequence numbers plus timestamps) based on equation [3] would be difficult or impossible to implement on many systems, which base their TCP implementation upon a very low granularity software clock, typically O(1 sec). To adapt the procedure to a system with a low granularity software clock, suppose that we calculate the ISN as: [4] ISN = ( R*Ts*floor(T/Ts) + q*CC) mod 2**32 where Ts is the time per tick of the software clock, CC is the connection count, and q is a constant. That is, the ISN is incremented by the constant R*Ts once every clock tick and by the constant q for every new connection. We need to choose q to obtain the required monotonicity. For monotonicity of the ISN's themselves, q=1 suffices. However, monotonicity during the entire connection requires q = R*Ts. This value of q can be deduced as follows. Let S(T, CC, n) be the sequence number for byte offset n in a connection with number CC at time T: S(T, CC, n) = (R*Ts*floor(T/Ts) + q*CC + n) mod 2**32. For any T1 > T2, we require that: S(T2, CC+1, 0) - S(T1, CC, n) > 0 for all n. Since R is assumed to be an upper bound on the transfer rate, we can write down: R > n/(T2 - T1), or T2/Ts - T1/Ts > n/(R*Ts) Using the relationship: floor(x)-floor(y) > x-y-1 and a little algebra leads to the conclusion that using q = R*Ts creates the required monotonic number sequence. Therefore, we consider: [5] ISN = R*Ts*(floor(T/Ts) + CC) mod 2**32 (which is the algorithm used for ISN selection by BSD TCP). For error-free operation, the sequence numbers generated by [5] must not wrap the sign bit in less than MSL seconds. Since CC cannot increase faster than TRmax, the safe condition is: R* (1 + Ts*TRmax) * MSL < 2**31. We are interested in the case: Ts*TRmax >> 1, so this relationship Braden [Page 19] RFC 1379 Transaction TCP -- Concepts November 1992 reduces to: [6] R * Ts * TRmax * MSL < 2**31. This shows a direct trade-off among the maximum effective bandwidth R, the maximum transaction rate TRmax, and the maximum segment lifetime MSL. For reasonable limiting values of R, Ts, and MSL, formula [6] leads to a very low value of TRmax. For example, with MSL= 2000 secs, R=10**9 Bps, and Ts = 0.5 sec, TRmax < 2*10**-3 Tps. To ease the situation, we could supplement sequence numbers with timestamps. This would allow an effective MSL of 2 seconds in [6], since longer times would be protected by differing timestamps. Then TRmax < 2**30/(R*Ts). The actual enforced MSL would be increased to 24 days. Unfortunately, TRmax would still be too small, since we want to support transfer rates up to R ~ 10**9 Bps. Ts = 0.5 sec would imply TRmax ~ 2 Tps. On many systems, it appears infeasible to decrease Ts enough to obtain an acceptable TRmax using this approach. 5.3 64-bit TCP Sequence Numbers Another possibility would be to simply increase the TCP sequence space to 64 bits as suggested in [RFC-1263]. We wou