March 2nd outage and upgrades

Details of the outage that occurred during an upgrade early march 2017.
 

Starting yesterday (Thursday March 2nd, 2017) at around 20:00EST (UTC-0500), the InIRC network witnessed a major netsplit, as one of the servers (che.indymedia.org) couldn’t link back with the other two. This was triggered by a reboot performed by Koumbit to do security upgrades. When trying to restore the network in proper order, I discovered various faults related to a configuration file refactoring I performed two days earlier but that didn’t take effect until that reboot.

It is still unclear the exact cause(s) of the netsplits. The ircd (Charybdis) is not very well maintained anymore and the main development branch (“Charybdis 4”) has actually been abandoned upstream. It could also be that the Charybdis upgrades performed on Feb. 28th were the cause of some of the problems.

The following day, starting at around 10:00EST, many different solutions were tried, but in the end the solution was to revert the configuration files to how they were before the refactoring. This was done using git checkout in /etc which caused massive permission problems that had to be fixed (with the magic .etckeeper file) to restore proper operation of the SSH server. That allowed two of the three servers (che and chat1) to be relinked, but we still couldn’t link all servers: when chat0 joined, it kicked chat1 out. The “hub” configuration was then changed to make chat1.koumbit.net the hub and the other two nodes only “leaf” servers (hub=no). Eventually, chat0.koumbit.net was promoted to hub=yes as well, without obvious adverse effects, che was left at a hub=no configuration. The network was relinked correctly at around 12:30EST.

One of the side-effects of the Charybdis upgrade is that TLSv1.0 is deprecated. If you are running on older platforms, you may run into problems negociating TLS connexions. Make sure your client support at least TLSv1.1. We have had reports of issues with stunnel on Debian 6 “wheezy” and a hardcoded sslVersion = TLSv1 setting, in particular.

A positive effect of the upgrades is that CertFP should now work on all servers.

Remaining issues:
1. some stunnel users had problems connecting
2. some users disconnect with the server link error we had before (Read error: Error in the pull function.) or another TLS error (Read error: The TLS connection was non-properly terminated.)
3. we are not sure servers could survive a full relink
4. configurations are not in sync. ircd.conf still varies slightly between servers

Solutions:
1. upgrade your shit. sslVersion = TLSv1 means TLSv1.0 AKA RFC 2246, released in 1999. That’s 18 years ago. v1.1 was published in 2006, and is supported in OpenSSL 1.0.1, which is shipped in Wheezy, so just upgrade your stuff.
2. it seems this is mostly happening with users that already had connectivity issues due mostly to unreliable connexions. it’s just a new error message because upstream bumped up error reporting
3. relink everything with ping_time=20s and be more patient with relinking. also, upstream proposed testing mbedtls (backport available in jessie) or we could try to cherry-pick some patches (build without aaf6039, 65b9b1d, and 8d0153f, or build from 2afd965 with gnutls.c from 3.5.5) to test if there’s a linking regression
4. we’d like to have our ircd.conf more uniform. we could complete that merge one server at a time, progressively.

A detailed chatlog of the operations is attached for future reference and details.

 
 

an update on this: I have managed to build charybdis against mbedtls, and in a jessie backport too. directives are here:

we.riseup.net/ircd/charybdis-maintenance

the package is in my home on chat0 and che.

the two servers link well now, but can’t link against chat1, presumably because of the bug in gnutls.

we need to figure out how to migrate over the new servers now, since we can’t link them anymore… i guess we need to flip the switch?

comments?

 
   

final update on this: the servers were all upgraded to latest charybdis with mbedtls recently, and the network is all cross-linked again, whee!