This archive is retained to ensure existing URLs remain functional. It will not contain any emails sent to this mailing list after July 1, 2024. For all messages, including those sent before and after this date, please visit the new location of the archive at https://mailman.ripe.net/archives/list/[email protected]/
[routing-wg] Improving operations at RIPE NCC TA (Was: Delay in publishing RPKI objects)
- Previous message (by thread): [routing-wg] Improving operations at RIPE NCC TA (Was: Delay in publishing RPKI objects)
- Next message (by thread): [routing-wg] Improving operations at RIPE NCC TA (Was: Delay in publishing RPKI objects)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Daniel Suchy
danny at danysek.cz
Wed Feb 17 17:22:46 CET 2021
Hello, I agreee with Job that reliability of TA needs to be improved and I fully support ideas described by Job below. - Daniel On 2/17/21 4:58 PM, Job Snijders via routing-wg wrote: > Dear RIPE NCC, > > On Wed, Feb 17, 2021 at 11:28:32AM +0100, Nathalie Trenaman wrote: >>> The multitude of RPKI service impacting events as a result from >>> maloperation of the RIPE NCC trust anchor are starting to give me >>> cause for concern. >> >> I’m sorry to hear this. Transparency is key for us, this means that we >> report any event. In this case, we were not compliant with our CPS and >> this non-publishing period had operational impact. > > From the previous email there might be a misunderstanding about what > rpki-client and Routinator do. Both utilities help Relying Parties > validate X.509 signed CMS objects and transform the validated content > into authorizations and attestations. Neither utility is a SLA or CPS > compliance monitor. RIPE NCC - as CA operator - needs different tools. > > Neither utility has been designed to interpret the Certification > Practise Policy (written in a natural language) and subsequently > programmatically transform the described 'Service Level' into metrics > suitable for monitoring. > > A relying party can never tell the difference between a publication > pipeline being empty because CAs didn't issue new objects, or a > publication pipeline being empty because of a malfunction in one of RIPE > NCC's RPKI subsystems. > > More examples of 'out of scope' functionality for Relying Party > software: validators don't monitor whether lirportal.ripe.net is > functional, whether RIPE NCC's BPKI API endpoints are operational, or > whether LIRs paid their invoices, the list is quite long. The validators > by themselves are the wrong tool for RPKI CPS/SLA monitoring. > > You state "transparency is key for us", but I fear ad-hoc low-quality > a-posteriori reports are not the appropriate mechanism to impress and > reassure this community regarding 'transparency'. > > I have some tangible suggestions to RIPE NCC that will improve the > reliability of the Trust Anchor and potentially help rebuild trust: > > A need for Certificate Transparency > ----------------------------------- > > RIPE NCC should set up a Certificate Transparency project which publicly > shows which certificates (fingerprints) were issued when, and store such > information in immutable logs, accessible to all. > > How can anyone trust a Trust Anchor which does not offer transparency > about its issuance process? > > Lack of transparency to signer software > --------------------------------------- > > The RIPE NCC WHOIS database software is open source, as is most of the > software for RIPE Atlas, K-ROOT, and other efforts RIPE NCC has > undertaken over the years. > > Why has the signer source code still not open sourced? Why can't members > review the code related to scheduled changes? Why is an organisation > proclaiming 'transparency' being opaque about how the RPKI certificates > are issued? > > Lack of Public status dashboard > ------------------------------- > > RIPE NCC should set up a website like https://rpki-status.ripe.net/ > which shows dashboards with graphs and traffic lights related to each > (best effort) commitment listed in the CPS. RIPE NCC should continuously > publish & revoke & delete objects and verify whether those activities > are visible externally, and then automatically report whether any > potential delays observed are within the Service Levels outlined in the > CPS. > > Metrics that come to mind: > > * delta between last certificate issuance & successful publication > * Object count in the repository, repo size (and graphs) > * Time-To-Keyroll (and graphs on duration & frequency) > * Resource utilisation of various RPKI subsystems > * aggregate bandwidth consumption for RPKI endpoints (including rrdp, API, rsync) > * Graphs & logs of overlap between INRs listed on EE certificates under > the RIPE TA and other commonly used TAs, matched against known > transfers. This will help detect compromises as well as understand > whether transfers are successful or not. > * Unique client IP count for RSYNC & RRDP for last hour/day/week > * Number of CS/hostmaster tickets mentioning RPKI > > There is are plenty of aspects to monitor, perhaps some notes should be > copied from how the DNS root is monitored. > > Lack of operational experience with BGP ROV at RIPE NCC > ------------------------------------------------------- > > I believe the number of potential future incidents related to the RIPE > NCC Trust Anchor can be prevented (or remediation time reduced) if RIPE > NCC themselves apply RPKI based BGP Origin Validation 'invalid == > reject' policies on the AS 3333 EBGP border routers. RIPE NCC OPS > themselves having a dependency on the RPKI services will increase > organization-wide exposure to the (lack of) well-being of the Trust > Anchor services, and given the short communication channels between the > OPS team and the RPKI team my expectation is that we'll see problems > being solved faster and perhaps even problems being prevented. > > An analogy: RIPE NCC is a kitchenchef refusing to eat their own food. > How can we trust RIPE NCC to operate RPKI services, when RIPE NCC > themselves refuses to apply the cryptographic products to their BGP > routing decisions? "RPKI for thee but not for me?" > > Surely RIPE NCC staff has not disabled DNSSEC sig checking on their > resolvers, or disabled WebPKI TLS validation in their browsers? I'm not > joking, it makes zero sense to participate in a PKI and at the same time > not participate in the same way everyone outside RIPE NCC depend on the > service. > > I am very aware of potential for circular dependencies between BGP and > RPKI, and I know exactly how catch-22s can be avoided. Unfortunately it > appears my feedback is ignored, problem reports remain unresolved. > > Reporting issues has become a thankless effort, useless because no > remediation actions are taken, and obviously RIPE NCC staff are growing > tired of hearing about problems (but if one wishes to stop hearing about > problems... perhaps solve the issues, instead of a 'head in the sand' > approach?!) > > Conclusion & Call to action > --------------------------- > > There is a fair chunk of work ahead for RIPE NCC, but RIPE NCC has a > multi-million budget and talented dedicated staff to achieve the above. > None of the above is impossible or unreasonable to ask from Trust > Anchors. > > I recognize how the above information reflects negatively on the current > state of the RIPE NCC Trust Anchor. But the reality of the situation is > that we see an outage every few weeks, there is an apparent lack of > architectural oversight how to improve. I really hope this is a > temporarily state of being, on which we can look back a year from now as > "haha remember those RPKI teething pains?". I wish for RIPE NCC to > be successful in operating the Trust Anchor. > > RIPE NCC would to well to allow themselves to be vulnerable to criticism > by exposing service level metrics and efforts like production of public > merkle tree logs - reflecting the certificate issuance process. RIPE NCC > should allow itself to be held accountable - which can only happen if > there is visibility into where friction exists. > > Does RIPE NCC understand the precariousness of the current situation and > the negative impact on the long term viability of the RPKI if course is > not corrected? > > This email outlines deliverables, will RIPE NCC commit to those? What > timelines can the community expect? What kind of help is needed to > achieve this? > > Kind regards, > > Job >
- Previous message (by thread): [routing-wg] Improving operations at RIPE NCC TA (Was: Delay in publishing RPKI objects)
- Next message (by thread): [routing-wg] Improving operations at RIPE NCC TA (Was: Delay in publishing RPKI objects)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]