git.meshlink.io Git - utcp/blob - README

   1 This is a light-weight, user-space implementation of RFC 793 (TCP), without any
   2 reliance on an IP layer.  It can be used to provide multiple in-order, reliable
   3 streams on top of any datagram layer.
   4
   5 UTCP does not rely on a specific event system. Instead, the application feeds
   6 it with incoming packets using utcp_recv(), and outgoing data for the streams
   7 using utcp_send(). Most of the rest is handled by callbacks. The application
   8 must however call utcp_timeout() regularly to have UTCP handle packet loss.
   9
  10 The application should run utcp_init() for every peer it wants to communicate
  11 with.
  12
  13 DIFFERENCES FROM RFC 793:
  14
  15 * No checksum. UTCP requires the application to handle packet integrity.
  16 * 32-bit window size. Big window sizes are the default.
  17 * No ECN, PSH, URG
  18
  19 TODO v1.0:
  20
  21 * Implement send buffer
  22 * Window scaling
  23 * Handle retransmission
  24   - Proper timeout handling
  25
  26 TODO v2.0:
  27
  28 * Nagle (add PSH back to signal receiver that now we want an immediate ACK?)
  29 * NAK and SACK
  30 * Congestion window scaling
  31 * Timestamps?
  32
  33 Future ideas:
  34
  35 Fast open:
  36         SYN + data?
  37
  38 Receive-only open:
  39         SYN|FIN
  40
  41 Fast transaction:
  42         SYN|FIN + request data ->
  43         <- SYN|ACK|FIN + response data
  44         ACK ->
  45
  46 Does this need special care or can we rely on higher level MACs?
  47
  48 RFCs
  49 ----
  50
  51 793  Transmission Control Protocol (Functional Specification)
  52 2581 TCP Congestion Control
  53 2988 Computing TCP's Retransmission Timer
  54
  55
  56
  57 INVARIANTS
  58 ----------
  59
  60 - snd.una: the sequence number of the first byte we did not receive an ACK for
  61 - snd.nxt: the sequence number of the first byte after the last packet we sent (due to retransmission, this may go backwards)
  62 - snd.wnd: the number of bytes we have left in our (UTCP/application?) input buffer
  63 - snd.last: the sequence number of the last byte that was enqueued in the TCP stream (increases only monotonically)
  64
  65 - rcv.nxt: the sequence number of the first byte after the last one we passed up to the application
  66 - rcv.wnd: the number of bytes the receiver has left in its input buffer (may be more/less than our send buffer size)
  67
  68 - The only packets that do not have ACK set must either have SYN or RST set
  69 - Only packets received with rcv.nxt <= hdr.seq <= rcv.nxt + rcv.wnd are valid, drop others.
  70 - If it has ACK set, and it's higher than snd.una, update snd.una.
  71   But don't update it past c->snd.next. (RST in that case?)
  72
  73 - SYN and FIN each count as one byte for the sequence numbering, but no actual byte is transferred in the payload.
  74
  75 CONNECTION TIMEOUT
  76 ------------------
  77
  78 This timer is intended to catch the case when we are waiting very long for a response but nothing happens.
  79 The timeout is in the order of minutes.
  80
  81 - The conn timeout is set whenever there is unacknowledged data, or when we are in the TIME_WAIT status.
  82 - If snd.una is advanced while the timeout is set, we re-set the timeout.
  83 - If the conn timeout expires, close the connection immediately.
  84
  85 RETRANSMIT TIMEOUT
  86 ------------------
  87
  88 (See RFC 2988, 3366)
  89
  90 This timer is intended to catch the case where we didn't get an ACK from the peer.
  91 In principle, the timeout should be slightly longer than the maximum latency along the path.
  92
  93
  94 - The rtrx timeout is set whenever snd.nxt is advanced.
  95 - If the rtrx timeout expires, retransmit at least one packet, and re-set the timeout.
  96
  97 STATES
  98 ------
  99
 100 CLOSED: this connection is closed, all packets received will result in RST.
 101   RX: RST
 102   TX: return error
 103   RT: clear timers
 104   RST: ignore
 105
 106 LISTEN: (= no connection yet): only allow SYN packets, it application does not accept, return RST|ACK, else SYN|ACK.
 107   RX: on accept, send SYNACK, go to SYN_RECEIVED
 108   TX: cannot happen
 109   RT: cannot happen
 110   RST: ignore
 111
 112 SYN_SENT: we sent a SYN, now expecting SYN|ACK
 113   RX: must be valid SYNACK, send ACK, go to ESTABLISHED
 114   TX: put in send buffer (TODO: send SYN again with data?)
 115   RT: send SYN again
 116
 117 SYN_RECEIVED: we received a SYN, sent back a SYN|ACK, now expecting an ACK
 118   RX: must be valid ACK, go to ESTABLISHED
 119   TX: put in send buffer (TODO: send SYNACK again with data?)
 120   RT: send SYNACK again
 121
 122 ESTABLISHED: SYN is acked, we can now send/receive normal data.
 123   RX: process data, return ACK. If FIN set, go to CLOSE_WAIT
 124   TX: put in send buffer, segmentize and send
 125   RT: send unACKed data again
 126
 127 FIN_WAIT_1: we want to close the connection, and just sent a FIN, waiting for it to be ACKed.
 128   RX: process data, return ACK. If our FIN is acked, go to FIN_WAIT_2, if a FIN was also received, go to CLOSING
 129   TX: return error
 130   RT: send unACKed data or else FIN again
 131
 132 FIN_WAIT_2: our FIN is ACKed, just waiting for more data or FIN from the peer.
 133   RX: process data, return ACK. If a FIN was also received, go to CLOSING
 134   TX: return error
 135   RT: should not happen, clear timeouts
 136
 137 CLOSE_WAIT: we received a FIN, we sent back an ACK
 138   RX: only return an ACK.
 139   TX: put in send buffer, segmentize and send
 140   RT: send unACKed data again
 141
 142 CLOSING: we had already sent a FIN, and we received a FIN back, now waiting for it to be ACKed.
 143   RX: if it's ACKed, set conn timeout, go to TIME_WAIT
 144   TX: return an error
 145   RT: send unACKed data or else FIN again
 146
 147 LAST_ACK: we are waiting for the last ACK before we can CLOSE
 148   RX: if it's ACKed, go to CLOSED
 149   TX: return an error
 150   RT: send FIN again
 151
 152 TIME_WAIT: connection is in princple closed, but our last ACK might not have been received, so just wait a while to see if a FIN gets retransmitted so we can resend the ACK.
 153   RX: if we receive anything, reset conn timeout.
 154   TX: return an error
 155   RT: should not happen, clear rtrx timeout
 156
 157 SEND PACKET
 158 -----------
 159
 160 - Put the packet in the send buffer.
 161 - Decide how much to send:
 162   - Not more than receive window allows
 163   - Not more that congestion window allows
 164 - Segmentize and send the packets
 165 - At the end, snd.nxt is advanced with the number of bytes sent
 166 - Set the rtrx and conn timers if they have not been set
 167
 168 RETRANSMIT
 169 ----------
 170
 171 - Decide how much to send:
 172   - Not more than we have in the send buffer
 173   - Not more than receive window allows
 174   - Not more that congestion window allows
 175 - Segmentize and send packets
 176 - No advancement of sequence numbers happen
 177 - Reset the rtrx timers
 178
 179 RECEIVE PACKET
 180 --------------
 181
 182 1 Drop invalid packets:
 183   a Invalid flags or state
 184   b ACK always set
 185   c hdr.seq not within our receive window
 186   d hdr.ack ahead of snd.nxt or behind snd.una
 187 2 Handle RST packets
 188 3 Advance snd.una?
 189   a reset conn timer if so
 190   b check if our SYN or FIN has been acked
 191   c check if any data been acked
 192     - remove ACKed data from send buffer
 193     - increase cwnd
 194   d no advance? NewReno
 195 4 If snd.una == snd.nxt, clear rtrx and conn timer
 196 5 Process state changes due to SYN
 197 6 Send new data to application
 198 7 Process state changes due to FIN
 199
 200 CONGESTION AVOIDANCE
 201 --------------------
 202
 203 We want to send as much packets as possible that won't cause any packets to be
 204 dropped.  So we should not send more than the available bandwidth, and not more
 205 in one go than buffers along the path can handle.
 206
 207 To start, we use "self-clocking". We send one packet, and wait for an ACK
 208 before sending another packet. On a network with a finite bandwidth but zero
 209 delay (latency), this will send packets as efficiently as possible. We don't
 210 need any timers to control the outgoing packet rate, that's why we call this
 211 self-clocked. However, latency is non-zero, and this means a number of packets
 212 is always on the way between the sender and receiver. The amount of packets
 213 "inbetween" is in principle the bandwidth times the delay (bandwidth-delay
 214 product, or BDP).
 215
 216 Delay is fairly easy to measure (equal to half the round-trip time of a packet,
 217 which in TCP is easily obtained from the SYN and SYNACK pair, or the ACK in
 218 response of a segment), however bandwidth is more difficult and might change
 219 more rapidly than the latency.
 220
 221 Back to the "inbetween" packets: ideally we would like to fill the available
 222 inbetween space completely. It should be easy to see that in that case,
 223 self-clocking will still work as intended. Our estimate of the amount of
 224 packets in the inbetween space is called the congestion window (CWND).  If we
 225 know the BDP, we can set the CWND to it, however if we don't know it, we can
 226 start with a small CWND and gradually increase it (for example, every time we
 227 receive an ACK, send the next 2 segments). At some point, we will start sending
 228 at a higher rate than the available bandwidth, in which case packets will
 229 inevitably be lost. We detect that because we do not receive an ACK for our
 230 data, and then we have to reduce the CWND (for example, by half).
 231
 232 The trick is to choose an algorithm that best keeps the CWND to the effective
 233 BDP.
 234
 235 A nice introduction is RFC 2001.
 236
 237 snd.cwnd: size of the congestion window.
 238 snd.nxt - snd.una: number of unacknowledged bytes, = number of bytes in flight.
 239 snd.cwnd - (snd.nxt - snd.una): unused size of congestion window