This archive is retained to ensure existing URLs remain functional. It will not contain any emails sent to this mailing list after July 1, 2024. For all messages, including those sent before and after this date, please visit the new location of the archive at https://mailman.ripe.net/archives/list/ncc-services-wg@ripe.net/

[ncc-services-wg] Draft Cloud Strategy Framework

Previous message (by thread): [ncc-services-wg] RIPE NCC and the Cloud: Draft Principles, Requirements and Strategy Framework
Next message (by thread): [ncc-services-wg] Interim RIPE NCC Services WG Meeting

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Mark Scholten mark at mscholten.eu
Wed Sep 1 14:26:15 CEST 2021

Hello,

Just my opinion about some points you did give (and after reading most other reactions about the cloud strategy framework).

1. No one is against adding another datacenter (or moving 1 datacenter location) to a location outside the Netherlands. By looking for a good location in the RIPE service region it would solve many of these points. And for some data (mainly read only parts) a copy/mirror could be provided on another URL in other parts in the RIPE service region if required. This way the change that everything goes offline at the same moment will be small. I know RIPE NCC has staff experienced with DNS solutions, for read only data that is send to the public maybe looking at the same basic technical solutions might help.

2. A private cloud could help a lot here I think. Maybe combine it with looking for cloud/VM providers for some of the mirror/frontends with read only data mentioned above. For expected short peaks it is also possible to rent server capacity for shorter periods.

3. This will probably not be solved by moving to a public cloud. Pipelines can also be build with a private cloud or even without. Many things you did mention don't sound like a technical problem but maybe more like a communication problem between people/departments. Outsourcing some technical parts to run your own private cloud might help. Eg all hardware/datacenter related work can be outsourced, enough options are available for this in the RIPE region.

If you want to outsource things look for past resilience in that area for the past few years and compare it with your own experience. I expect that RIPE NCC is doing it really good compared to many public clouds (and compare what you would use at them, what you will not use is not important to compare).

Kind regards,

Mark

> -----Original Message-----
> From: ncc-services-wg <ncc-services-wg-bounces at ripe.net> On Behalf Of
> Felipe Victolla Silveira
> Sent: Thursday, August 26, 2021 14:47
> To: ncc-services-wg <ncc-services-wg at ripe.net>
> Subject: Re: [ncc-services-wg] Draft Cloud Strategy Framework
> 
> Dear Gert, Hank,
> 
> First, our apologies again for the delay in our response. A few of us were
> taking our summer break and our colleagues didn't want to respond without
> checking with us first.
> 
> To recap, we’ve outlined our core goals - improve the resilience of our
> services, become more agile and flexible as an organisation, and focus
> engineering expertise on our core business. You correctly point out that we
> haven't really talked about the problems we’re trying to solve.
> 
> Fair point - we're not used to talking about the firefighting that's needed
> behind the scenes. We can go over some of this now. We can start by noting
> that if you take the inverse of the benefits we've listed so far, you find most
> the problems we're trying to solve.
> 
> 1. Improve resilience and availability
> 
> We currently host our infrastructure in two data centres in Amsterdam.
> While they have provided excellent availability so far, users further afield
> (South America, Oceania, Asia) experience high latency when accessing our
> services. Importantly, an outage affecting both of these data centres would
> render all of our services offline.
> 
> Public cloud providers have many global regions available, allowing us to
> choose the level of resilience that best fits a particular service - protecting us
> against multiple hardware failures or natural disasters (remember that we
> are below sea level here).
> 
> 2. Become more agile and flexible
> 
> We're proud of the stable and highly-available services we provide. Here we
> can credit the expertise and hard work of our engineering staff, but also a
> continuous investment in our infrastructure over time. This has a big
> footprint - we are currently using almost 50 racks across our two data
> centres.
> 
> Each hardware element has its own lifecycle: procurement, shipping,
> installation, configuration, patching, upgrading and retiring. With hundreds of
> servers, network and storage equipment, this is a continuous operation that
> takes a lot of time and effort. Hardware maintenance is not even the biggest
> challenge here: our infrastructure doesn't offer much in the way of flexibility
> and making changes is complex and expensive.
> 
> Our infrastructure also lacks elasticity, meaning that we have to estimate
> demand and over-provision our services to cover any peaks. This makes us
> less agile, by forcing us into long-term commitments and requiring us to pay
> for a lot of unused or idle resources.
> 
> 3. Focus engineering expertise on our core business
> 
> For each new application or change to our infrastructure, there are a lot of
> manual steps that require tickets back and forth between separate
> engineering teams. Getting from idea to reality can take many months, and
> we can see this impacting our ability to innovate. This is inevitable when
> attention turns from service excellence to fixing problems and time-
> consuming, mundane maintenance tasks. We especially don't like this
> because we often need to react quickly as an organisation, while also being
> able to experiment with new services in an efficient way.
> 
> By moving to the cloud, we can build pipelines to deploy code faster, with
> fewer errors and manual steps, and provide sandbox accounts for engineers
> to quickly and safely test new technologies. We can also automate security
> auditing and reporting as much as possible, at all application and
> infrastructure layers.
> 
> There were two good comments on the article recently, from Niall Murphy
> and Bert Hubert. We will respond to these soon, but I would like to reference
> one point Bert makes there, which is essentially "Don't outsource your key
> capabilities." We completely agree with this (many of us have been reading
> Bert's article on this topic recently). This is precisely what we are *not* doing.
> 
> While it is important to have in-house expertise on all technical layers, some
> are more important than others. For example, at the physical layer we are
> already using data centre remote hands to replace failed disks, and we
> generally want to eliminate as much of the repetitive work to unpack, rack,
> and cable equipment in the data centre as we can. The resources we save
> here can be used to double down on the capabilities we want to develop
> further. We will continue to write our own software and control our
> deployment pipelines, and configure routers, firewalls, load balancers, and
> storage devices - whether they are physical or virtual, on-premise or in the
> cloud.
> 
> I see Hank's suggestion that we compile a list of outages. I'm reluctant to ask
> our engineers to spend time on this when I think they'll find we have very
> resilient services. But past results are not always the best indicator of future
> performance. And with RPKI especially, I also expect that what we consider
> acceptable resilience might increase as more and more networks come to
> rely on it.
> 
> >(Also I find "evade the discussion on the list by posting a new lengthy article
> on labs every few months" not really helpful)
> 
> I do want to respond to this point. We sometimes miss a comment or take
> longer to respond than is acceptable, and this is not something that we take
> lightly as a company. But I would be disappointed if the community thought
> we were trying to evade discussion. We are here, we are listening, and we
> will respond.
> 
> With that, it's over to you again - let me know if you feel I’ve missed anything
> here.
> 
> Regards
> 
> Felipe Victolla Silveira
> Chief Operations Officer
> RIPE NCC

Previous message (by thread): [ncc-services-wg] RIPE NCC and the Cloud: Draft Principles, Requirements and Strategy Framework
Next message (by thread): [ncc-services-wg] Interim RIPE NCC Services WG Meeting

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

[ ncc-services-wg Archives ]