This archive is retained to ensure existing URLs remain functional. It will not contain any emails sent to this mailing list after July 1, 2024. For all messages, including those sent before and after this date, please visit the new location of the archive at https://mailman.ripe.net/archives/list/routing-wg@ripe.net/
[routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022
- Previous message (by thread): [routing-wg] rsync://rpki.ripe.net rsyncd limits set too low?
- Next message (by thread): [routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Ties de Kock
tdekock at ripe.net
Wed Feb 16 17:01:25 CET 2022
Dear colleagues, This afternoon, between 13:00 UTC and 14:10 UTC rrdp.ripe.net was unavailable. During this period, a significant fraction of relying party instances attempting to fall back to rsync://rpki.ripe.net could not retrieve objects due to capacity constraints. At approximately 13:00 UTC, the RPKI team attempted to move the DNS records for rrdp.ripe.net from out of the ripe.net zone file into a separate include file. We did this change to prepare for implementing an automated failover between the CDNs. This resulted in an outage in the RRDP service, which was caused by an issue in the ripe.net zone file in the DNS zone. The file contains several $ORIGIN directives, but they are not reset properly when a block ends. The consequence is that later relative names in the zone file accidentally get the incorrect origin applied to them, and it is easy to miss this if the $ORIGIN directive appears much earlier in the file. To prevent such DNS issues in the future, all the blocks will be moved out of the main zone file into separate include files, because $ORIGIN directives in them do not persist beyond the end of the file. Also, earlier today, we hit an issue that our monitoring was broken due to a change in the prometheus configuration file. This reduced our visibility into the outage and meant no alerts were sent until this recovered. A third contributing factor was that a secondary monitoring system monitoring the RPKI prometheus infrastructure did not alert due to the web interface returning an HTTP 200 despite the broken configuration. A final factor was that the capacity of rsync://rpki.ripe.net is limited. Only part of the relying party instances that attempted to fall back could update from rsync. This prevented relying party instances from retrieving new objects. Full timeline: * 07:04 UTC: broken alert configuration committed * 08:46 UTC: broken alert configuration applied, breaking monitoring. * 13:02 UTC: DNS change (effectively removing rrdp.ripe.net from zone) applied * 13:44 UTC: alert configuration reverted * 14:10 UTC: DNS configuration recovered * 14:25 UTC: rsync connection rate back at baseline level During the period where rrdp.ripe.net was not available, many relying party instances started falling back to rsync. On partial data available, we observed a median rsync connection duration of 300 seconds, and 99th percentile of 1660 seconds, with ~55% of rsync connections disconnecting with an error code. Based on this preliminary data we conclude that this is indicative of underlying IO limitation in our NFS setup. We will further investigate this. During the period of outage, our rsync servers returned 5043 “max connection reached” errors to 2307 unique IP addresses. We have applied one mitigation (linting of alert configuration). We are also working on improving our external monitoring without a dependency on on-premise infrastructure. Kind regards, Ties
- Previous message (by thread): [routing-wg] rsync://rpki.ripe.net rsyncd limits set too low?
- Next message (by thread): [routing-wg] RFO for RIPE NCC RPKI outage 16 February 2022
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]