Guus Sliepen [Thu, 29 Oct 2020 22:38:22 +0000 (23:38 +0100)]
Also send the blacklist notification when we already have a connection.
Instead of just closing the connection, and having to wait for the
reconnection to happen to send the blacklist notification, we do it
immediately when meshlink_blacklist() is called.
Guus Sliepen [Sun, 25 Oct 2020 21:17:29 +0000 (22:17 +0100)]
Check blacklist status before committing an invitation.
Although we delete invitation files when blacklisting a node, there is a
race condition where an invitation connection is created right before the
invitee is blacklisted. So check that the node is blacklisted right before
committing the node config file to disk.
Guus Sliepen [Sun, 11 Oct 2020 14:16:31 +0000 (16:16 +0200)]
When a new connection is activated, terminate any pending connections to the same peer.
This prevents issues mainly in the test suite where peers try to connect to
each other simultaneously, and have to terminate one of the connections.
Before both connections would succeed, and both would be terminated, leading
to a loop of reconnections until enough randomness got in to break the tie.
Guus Sliepen [Sun, 11 Oct 2020 13:40:34 +0000 (15:40 +0200)]
Don't reset the UDP SPTPS session when a node becomes reachable.
Only do this when it becomes unreachable. This fixes an issue where right
after a meta-connection is established, the initiator sends a proactive
REQ_KEY, before the peer really becomes reachable according to the graph.
When the latter happened, it would reset the session so far, causing a new
REQ_KEY to be sent, which could cross the ANS_KEY from the peer. This would
resolve itself after a few seconds, but causes an unnecessary delay that is
easy to trigger.
Closing a channel while there was data in the receive buffer would cause a
RST to be sent instead of a FIN. We now always send a FIN, and let data
in the receive buffer be handled for a later data handling (which would
then send a RST if necessary).
The RST could be dropped if the ACK seqno was not in the correct range.
We now always accept RSTs for established connections.
Finally, when receiving more data after closing the channel, we would just
accept the data but discard it, instead of sending a RST back. Now we do
send a RST back.
Send RST packets when receiving data after we closed a UDP channel.
If the application closed a channel, we keep the UTCP connection alive for
a bit longer to handle resends of FIN packets. However, if this is missed
for some reason, either because the FIN got lost or the peer ignored the
receive callback, and the peer is sending new data, we need to inform it
that we are no longer listening. To do this, send a RST back.
Don't use fast timeouts for fully established connections.
During the fast retry period, we want to have a fast ping timeout until we have
a fully working connection. However, the code still used fast timeouts during
the fast retry window even if the connection was fully established.
Allow sptps_force_kex() while a key exchange is in progress
We should not do anything if we are already exchanging a new key, and
just return true. This change prevents higher layers in MeshLink from
terminating a connection between two nodes if both peers call
sptps_force_kex() at nearly the same time.
Use the canonical address exclusively for making outgoing meta-connections.
If we have a node's canonical address, we now always use that as a source
for addresses for outgoing meta-connection attempts. This commit also adds
the function meshlink_clear_canonical_address() to ensure the canonical
address can be removed if it is no longer valid.
Guus Sliepen [Tue, 4 Aug 2020 13:24:07 +0000 (15:24 +0200)]
Remove temporary files at startup.
When something happens while a host config files is written, a temporary
file might be left over. Clean these up when we find them when starting
MeshLink.
The accept callback is called when the peer has already fully established a
connection. The listen callback is called earlier, when there is no
fully established channel yet. However, the listen callback itself does not
get a channel handle, it can only make a decision based on the peer node
and port number whether to accept the channel, and if so the accept callback
will be called later.
Always let the initiator send a REQ_KEY once a connection is activated.
Before, the logic was to do this when the graph reported a bidirectional
edge. However, there was a possibility that if two nodes connect to each
other simultaneously, causing a second connection to be activated while the
first was also still active, which caused the REQ_KEY to not be sent.
This function is similar to meshlink_set_node_status_cb(), except that
this callback will only be called when a meta-connection to a node is
activated or terminated. This is mainly useful for the test suite.
Fix invitation URL generation when running in a network namespace.
MeshLink could call getifaddrs() in the namespace of the caller instead of
the MeshLink thread, causing the wrong addresses to be put in the inviation
URL.
Don't use assert() to check the results of pthread_*() calls.
This was done to debug the code, but it fails when MeshLink is compiled with
-DNDEBUG. Remove all assert()s from calls to pthread functions, and instead
add explicit checks to only those functions that can fail.
Guus Sliepen [Sun, 14 Jun 2020 12:45:18 +0000 (14:45 +0200)]
React faster to network changes, including point-to-point links.
Tell Catta to also include point-to-point links, and when we get an
update from the Catta thread, wake up the main MeshLink thread so we
react to it immediately.
Guus Sliepen [Thu, 11 Jun 2020 19:52:00 +0000 (21:52 +0200)]
Use atomic operations to check whether to write to the signal pipe.
We need to do an atomic test-and-set operation to check whether we can
avoid writing to the signal pipe. Use C11 atomics to do this in a portable
way (hopefully).
Guus Sliepen [Thu, 11 Jun 2020 20:17:23 +0000 (22:17 +0200)]
Add asserts() to all pthread related function calls.
We normally expect all pthread-related functions to succeed, so in all
places where we didn't already explicitly check the return value, assert()
that the functions return 0.
Guus Sliepen [Wed, 10 Jun 2020 20:25:12 +0000 (22:25 +0200)]
Properly initialize mutexes and condition variables.
On Linux, zeroing a pthread_mutex_t or pthread_cond_t variable ensures
the mutex/cond is properly initialized, however this is not the case om
some other platforms. Ensure we always call pthread_mutex/cond_init().
Guus Sliepen [Fri, 5 Jun 2020 16:09:38 +0000 (18:09 +0200)]
Fix meshlink_join() failing on Android.
The adns_blocking_request() function did not pass a hint to
getaddrinfo(). With glibc, the resulting struct addrinfo sets socktype
and protocol to SOCK_STREAM and IPPROTO_TCP, and the call to connect()
copied these values. However, bionic doesn't set the socktype and
protocol to those values if no hint was specified.
Guus Sliepen [Thu, 21 May 2020 12:48:02 +0000 (14:48 +0200)]
Explicitly set the stack size for the MeshLink thread.
Different libcs have different default sizes for newly created threads. In
particular, Musl defaults to 80 kB, which is too small for MeshLink. We now
request 1 MB, which should be more than enough to handle the deepest call
stacks.
Guus Sliepen [Fri, 15 May 2020 21:12:34 +0000 (23:12 +0200)]
Include our own key in REQ_PUBKEY requests.
If we don't know a peer's public key, it most likely means the peer
doesn't know our public key, so proactively send it along with the
REQ_PUBKEY request.
Before we allowed buf->offset to be equal to buf->size. This caused an
issue where buffer_call() would call the callback twice, once for 0
bytes at the end of the buffer, and once for len bytes at the start of
the buffer. This would cause the callback function to think the channel
had encountered an error.
If the data in the ringbuffer wraps around, and we call the receive
callback for the first part of the data, the callback function might
close the channel, so we must not call the callback for the second part
of the data.
Guus Sliepen [Mon, 11 May 2020 17:52:00 +0000 (19:52 +0200)]
Move UTCP into the MeshLink repository.
UTCP is not used outside of MeshLink at the moment, and there is a tight
coupling between the two, so it makes more sense to have it as part of
MeshLink itself.
Guus Sliepen [Fri, 8 May 2020 10:48:44 +0000 (12:48 +0200)]
Handle meshlink_channel_close() being called in callbacks.
When it's called in a callback, we can't free the channel until the
function that called the callback has a chance to safely complete. This
is not a problem for regular receive and poll callbacks, but it is for AIO,
where there can be multiple outstanding AIO buffers that each need their
callback called to signal completion, and each of them could potentially
call meshlink_channel_close().
This also ensures that when the channel is explicitly closed by the
application, it will not receive any further callbacks.
The event loop was assuming that a timespec value of {0, 0} meant that the
timer was not added to the timer tree. However, it was possible for other
parts of the code to set the value to {0, 0}, which could result in a
segmentation fault. Use the splay_node_t data pointer to check whether a
timeout is linked into the tree instead.
Several fixes for channel AIO send and receive functions.
- Process multiple buffers if possible
- Better handling error conditions
- fd errors now cancel the AIO buffer
- channel errors cancel all outstanding AIO buffers
- Don't call the poll callback with a length larger than the remaining
UTCP send buffer.
Make UTCP retranmissions trigger PMTU probes immediately.
If there are network problems while data is being transferred over a
channel, we want to react to this as soon as possible. Set the retranmission
callback to trigger the next PMTU probe immediately if there as none in
progress.
Allow meshlink_open() to be called with a NULL name.
This will use the name used last time the MeshLink instance was initialized.
If there is no initialized instance at the given confbase, it will return
an error.
Opening an instance with a different name than the one in the configuration
files will now also result in an error.
When resetting timers that use CLOCK_MONOTONIC, use a negative value.
CLOCK_MONOTONIC might be implemented as the time since the CPU booted, so
if MeshLink starts soon after booting, setting timers to "0" might not
actually be far enough in the past to trigger a timeout.
This has almost no effect in practice, since most timeouts are a minute or
less, but it might affect running tests in virtual machines.