This is a light-weight, user-space implementation of RFC 793 (TCP), without any reliance on an IP layer. It can be used to provide multiple in-order, reliable streams on top of any datagram layer. UTCP does not rely on a specific event system. Instead, the application feeds it with incoming packets using utcp_recv(), and outgoing data for the streams using utcp_send(). Most of the rest is handled by callbacks. The application must however call utcp_timeout() regularly to have UTCP handle packet loss. The application should run utcp_init() for every peer it wants to communicate with. DIFFERENCES FROM RFC 793: * No checksum. UTCP requires the application to handle packet integrity. * 32-bit window size. Big window sizes are the default. * No ECN, PSH, URG TODO v1.0: * Implement send buffer * Window scaling * Handle retransmission - Proper timeout handling TODO v2.0: * Nagle (add PSH back to signal receiver that now we want an immediate ACK?) * NAK and SACK * Congestion window scaling * Timestamps? Future ideas: Fast open: SYN + data? Receive-only open: SYN|FIN Fast transaction: SYN|FIN + request data -> <- SYN|ACK|FIN + response data ACK -> Does this need special care or can we rely on higher level MACs? RFCs ---- 793 Transmission Control Protocol (Functional Specification) 2581 TCP Congestion Control 2988 Computing TCP's Retransmission Timer INVARIANTS ---------- - snd.una: the sequence number of the first byte we did not receive an ACK for - snd.nxt: the sequence number of the first byte after the last packet we sent (due to retransmission, this may go backwards) - snd.wnd: the number of bytes we have left in our (UTCP/application?) input buffer - snd.last: the sequence number of the last byte that was enqueued in the TCP stream (increases only monotonically) - rcv.nxt: the sequence number of the first byte after the last one we passed up to the application - rcv.wnd: the number of bytes the receiver has left in its input buffer (may be more/less than our send buffer size) - The only packets that do not have ACK set must either have SYN or RST set - Only packets received with rcv.nxt <= hdr.seq <= rcv.nxt + rcv.wnd are valid, drop others. - If it has ACK set, and it's higher than snd.una, update snd.una. But don't update it past c->snd.next. (RST in that case?) - SYN and FIN each count as one byte for the sequence numbering, but no actual byte is transferred in the payload. CONNECTION TIMEOUT ------------------ This timer is intended to catch the case when we are waiting very long for a response but nothing happens. The timeout is in the order of minutes. - The conn timeout is set whenever there is unacknowledged data, or when we are in the TIME_WAIT status. - If snd.una is advanced while the timeout is set, we re-set the timeout. - If the conn timeout expires, close the connection immediately. RETRANSMIT TIMEOUT ------------------ (See RFC 6298.) This timer is intended to catch the case where we didn't get an ACK from the peer. In principle, the timeout should be slightly longer than the maximum latency along the path. - The rtrx timer is set whenever we send a packet that must be ACKed by the peer: - when it contains data - when SYN or FIN is set - The rtrx timer is reset when we receive a packet that advances snd.una. - it is cleared when snd.una == snd.last - otherwise the timeout is set to the value of utcp->rto - If the rtrx timer expires, retransmit at least one packet, multiply the timeout by two, and rearm the timeout. The value of RTO is calculated according to the RFC. At the moment, no timestamps are added to packets. When the RTT timer is not set, start it when sending a packet. When the ACK arrives, stop the timer and use the time difference as a measured RTT value. Use the algorithm from RFC 6298 to update RTO. STATES ------ CLOSED: this connection is closed, all packets received will result in RST. RX: RST TX: return error RT: clear timers RST: ignore LISTEN: (= no connection yet): only allow SYN packets, it application does not accept, return RST|ACK, else SYN|ACK. RX: on accept, send SYNACK, go to SYN_RECEIVED TX: cannot happen RT: cannot happen RST: ignore SYN_SENT: we sent a SYN, now expecting SYN|ACK RX: must be valid SYNACK, send ACK, go to ESTABLISHED TX: put in send buffer (TODO: send SYN again with data?) RT: send SYN again SYN_RECEIVED: we received a SYN, sent back a SYN|ACK, now expecting an ACK RX: must be valid ACK, go to ESTABLISHED TX: put in send buffer (TODO: send SYNACK again with data?) RT: send SYNACK again ESTABLISHED: SYN is acked, we can now send/receive normal data. RX: process data, return ACK. If FIN set, go to CLOSE_WAIT TX: put in send buffer, segmentize and send RT: send unACKed data again FIN_WAIT_1: we want to close the connection, and just sent a FIN, waiting for it to be ACKed. RX: process data, return ACK. If our FIN is acked, go to FIN_WAIT_2, if a FIN was also received, go to CLOSING TX: return error RT: send unACKed data or else FIN again FIN_WAIT_2: our FIN is ACKed, just waiting for more data or FIN from the peer. RX: process data, return ACK. If a FIN was also received, go to CLOSING TX: return error RT: should not happen, clear timeouts CLOSE_WAIT: we received a FIN, we sent back an ACK RX: only return an ACK. TX: put in send buffer, segmentize and send RT: send unACKed data again CLOSING: we had already sent a FIN, and we received a FIN back, now waiting for it to be ACKed. RX: if it's ACKed, set conn timeout, go to TIME_WAIT TX: return an error RT: send unACKed data or else FIN again LAST_ACK: we are waiting for the last ACK before we can CLOSE RX: if it's ACKed, go to CLOSED TX: return an error RT: send FIN again TIME_WAIT: connection is in princple closed, but our last ACK might not have been received, so just wait a while to see if a FIN gets retransmitted so we can resend the ACK. RX: if we receive anything, reset conn timeout. TX: return an error RT: should not happen, clear rtrx timeout SEND PACKET ----------- - Put the packet in the send buffer. - Decide how much to send: - Not more than receive window allows - Not more that congestion window allows - Segmentize and send the packets - At the end, snd.nxt is advanced with the number of bytes sent - Set the rtrx and conn timers if they have not been set RETRANSMIT ---------- - Decide how much to send: - Not more than we have in the send buffer - Not more than receive window allows - Not more that congestion window allows - Segmentize and send packets - No advancement of sequence numbers happen - Reset the rtrx timers RECEIVE PACKET -------------- 1 Drop invalid packets: a Invalid flags or state b ACK always set c hdr.seq not within our receive window d hdr.ack ahead of snd.nxt or behind snd.una 2 Handle RST packets 3 Advance snd.una? a reset conn timer if so b check if our SYN or FIN has been acked c check if any data been acked - remove ACKed data from send buffer - increase cwnd d no advance? NewReno 4 If snd.una == snd.nxt, clear rtrx and conn timer 5 Process state changes due to SYN 6 Send new data to application 7 Process state changes due to FIN CONGESTION AVOIDANCE -------------------- We want to send as much packets as possible that won't cause any packets to be dropped. So we should not send more than the available bandwidth, and not more in one go than buffers along the path can handle. To start, we use "self-clocking". We send one packet, and wait for an ACK before sending another packet. On a network with a finite bandwidth but zero delay (latency), this will send packets as efficiently as possible. We don't need any timers to control the outgoing packet rate, that's why we call this self-clocked. However, latency is non-zero, and this means a number of packets is always on the way between the sender and receiver. The amount of packets "inbetween" is in principle the bandwidth times the delay (bandwidth-delay product, or BDP). Delay is fairly easy to measure (equal to half the round-trip time of a packet, which in TCP is easily obtained from the SYN and SYNACK pair, or the ACK in response of a segment), however bandwidth is more difficult and might change more rapidly than the latency. Back to the "inbetween" packets: ideally we would like to fill the available inbetween space completely. It should be easy to see that in that case, self-clocking will still work as intended. Our estimate of the amount of packets in the inbetween space is called the congestion window (CWND). If we know the BDP, we can set the CWND to it, however if we don't know it, we can start with a small CWND and gradually increase it (for example, every time we receive an ACK, send the next 2 segments). At some point, we will start sending at a higher rate than the available bandwidth, in which case packets will inevitably be lost. We detect that because we do not receive an ACK for our data, and then we have to reduce the CWND (for example, by half). The trick is to choose an algorithm that best keeps the CWND to the effective BDP. A nice introduction is RFC 2001. snd.cwnd: size of the congestion window. snd.nxt - snd.una: number of unacknowledged bytes, = number of bytes in flight. snd.cwnd - (snd.nxt - snd.una): unused size of congestion window