This archive is retained to ensure existing URLs remain functional. It will not contain any emails sent to this mailing list after July 1, 2024. For all messages, including those sent before and after this date, please visit the new location of the archive at https://mailman.ripe.net/archives/list/ripe-atlas@ripe.net/

[atlas] Friday's events on RIPE Atlas

Previous message (by thread): [atlas] Booking.com IT Services, Singapore has joined RIPE Atlas anchors
Next message (by thread): [atlas] Friday's events on RIPE Atlas

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Romeo Zwart romeo.zwart at ripe.net
Tue Sep 22 17:34:10 CEST 2015

Dear all,

Here is a description of the events that took place last Friday and some
'lessons learned' that we took away from it.

The high level summary of events is that an Atlas user was authorised to
create an extreme number of measurements involving a large number of
probes. This effectively overloaded the back-end machinery in various
different ways. Even though this event could only happen because of the
exception in resource use limits, we have implemented workarounds and
countermeasures to avoid a repetition in future, and we will be
investigating some of the more fundamental issues in the coming period.

More information is available below for those interested.

Observations:
- Problems started shortly before 11am when one of the Atlas users
created a large number of new measurements each involving all available
probes. The user had been given an exceptional amount of credits as part
of a special experiment. Therefore the normal limitations on the impact
any individual user can have on the system were not active when the
measurements were created and activated.
- The results of the newly created measurements put a lot of strain on
the measurement scheduler, which triggered our interest. After some
investigation the cause of the overload was identified and the related
measurements were ended.
- However, by this time the majority of results, up to that moment, had
already reached our queuing servers and the consumers were already
ingesting the results into our Hadoop storage platform.
- At this phase we discovered a capacity problem with the process that
consumes the Atlas results, so we doubled the capacity of that component
on the fly.
- This exposed the next bottleneck in our platform in the form of an
accumulation of the created results on a very small number of processing
nodes. Normally the incoming measurement results are distributed over
several storage nodes, so this strongly reduced the consumption rate of
new data.
- A third factor that contributed was the fact that, in attempt to curb
growth of the Atlas data, we have migrated the Atlas data sets to a more
efficient compression algorithm earlier in the year. This saved us some
40-50% of storage space for the Atlas data, at the expense of some
compute power. Under normal circumstances, even at high loads, this
compute power is abundantly available on the storage cluster. Under the
specific circumstances of last Friday's events, it turned out that the
change of the compression algorithm had increased processing time for
some Hadoop system tasks by up to a factor of 8, which had a direct
impact on the data consumption speed.

Immediate actions taken:
- Removed special privileges of the end-user in question
- Added capacity to the Atlas consumer processes
- Returned (temporarily) to less efficient compression on the Atlas data
sets.

Lessons learned and further planned action:
- Granting special privileges for some of the Atlas users needs (even)
more attention than it already receives.
- We need to better communicate "best practices" to these power users so
they can use their extra allowances responsibly.
- Improved compression of Atlas data has decreased our storage demands
but also decreased our processing capacity. This needs further
investigation to find the optimum configuration.
- Investigate possibilities to better spread incoming results over more
worker nodes (reduce hotspots).
- Investigate and quantify reasonable boundaries of scalability of the
whole system, to guide the limits for granting credits to end users.

Kind regards,
Romeo Zwart

Previous message (by thread): [atlas] Booking.com IT Services, Singapore has joined RIPE Atlas anchors
Next message (by thread): [atlas] Friday's events on RIPE Atlas

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

[ ripe-atlas Archives ]