This archive is retained to ensure existing URLs remain functional. It will not contain any emails sent to this mailing list after July 1, 2024. For all messages, including those sent before and after this date, please visit the new location of the archive at https://mailman.ripe.net/archives/list/[email protected]/
[atlas]Probe flapping
- Previous message (by thread): [atlas]Probe flapping
- Next message (by thread): [atlas]Probe flapping
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Daniel Karrenberg
daniel.karrenberg at ripe.net
Thu Dec 16 14:57:46 CET 2010
Intermediate update to keep those interested informed. I am writing this to keep the engineers free to work the problem. I do not know nitty gritty details, so this is a general overview. No conclusions yet. Architecture: After registering with the RIPE Atlas network the probes are connected to "controllers" that handle requests to/from the probes. The architecture allows probes to use any controller in the system. Probes are distributed among controllers according to geographic and load balancing heuristics at the moment. We have four controllers at the moment: 1 in Germany on a dedicated server: jonin 1 in the US on a dedicated server: carson 2 in NL on RIPE NCC VMs: caldwell and zelenka You can see the number of probes associated with each controller and some other details on https://atlas.ripe.net/statistics This page is updated hourly. What happened: This morning zelenka was in standby and ronin started disassociating probes in a massive way. We do not know the root cause of this. The most likely cause so far is a connectivity problem but we are investigating with an open mind. The system reacted as designed and the probes dropped by ronin started to register with caldwell. Unfortunately caldwell became overloaded by this both because of its physical limitations and because of an unfortunate database configuration error. Probes associated to carson were not affected. What we are doing: We brought up Zelenka but as Murphy dictates the RIPE NCC firewall prevented probes from reaching it. This has been fixed and zelenka is now picking up probes. We working hard to fix a lot of minor problems uncovered by this and to get all probes re-connected and their data backlog processed. What we have learned so far: We need a larger safety margin in the capacity of the controllers vs the number of deployed probes. We will start moving caldwell and zelenka onto physical machines outside of firewalls and other complications. We also need to exercise moving probes among controllers and verify that the safety margin exists in reality. Personally I regard all this as normal teehting problems in a distributed computing deployment. So far the architecture is holding up well. Just the implementation has some flaws. Plwase bear with us. If anyone has suggestions for high quality hosting of controllers in the RIPE region, please drop me and Robert a private mail. Daniel
- Previous message (by thread): [atlas]Probe flapping
- Next message (by thread): [atlas]Probe flapping
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]