[Dnsmon-test] Feedback: data and backend

Robert Kisteleki robert at ripe.net
Thu Feb 13 16:05:00 CET 2014


Hello again,

On 2014.02.13. 14:55, Gilles Massen wrote:
> Hello,
> 
> As promised, some questions / comments on the data and measurements.
> Again, in no particular order.
> 
> Data quality: as an exercise I tried to investigate errors and speckles
> seen on the TTM-DNSMON on the Atlas-DNSMON. And I failed completely. So
> either I'm thick, or there is a fundamental difference between the old
> and new systems that really needs to be spelled out. Just to have an
> example: this graph (
> http://dnsmon.ripe.net/dns-servmon/server/plot?type=drops&server=k.dns.lu&af=ipv4&day=13&month=2&year=2014&hour=6&minutes=0&period=lastval&plot=SHOW
> ) shows many yellow spots. A similar view on Atlas-DNSMON (no link
> (hint,hint)) shows a perfect world (no unanswered queries). Where does

We have asks about "nicer URLs" and "permalinks" are on the being-worked-on
list already.

> the difference come from? No single element seems to explain it
> (frequency, different probes, geographic location), so what am I missing?

This is most likely caused by the fact that the new implementation is more
forgiving, uses higher timeouts. In fact, it shows that many of the replies
actually came in in 3-5 seconds(!); that shows up as green in the new
version, and as red in the old.

However, it's possible to show this event in the new system, using the
relative RTT graphs -- see the attachment.

> Server selection: could you please clarify what servers are monitored?
> Those advertised by the parent, or in the zone? How often is it checked?
> And what happens on a change? (is there a delay on removing void servers)?

For an initial server selection we use the list in the zone itself. At the
moment we're not tracking changes on this -- but it's something worth
investigating. Does it happen often? Should it be automatically changed?
Would you expect a certain reaction time from the system?

(Background info: we're wondering if this is a frequent enough event for us
to build processes around it.)

> Probe quantity and location: the use of atlas anchors is certainly a
> good idea. However given the quantity and locations of anchors (almost
> nothing beyond Europe) I'd like to suggest to include a few normal
> probes in the underrepresented regions, and phase them out when more
> anchors become available. They could be handpicked based on uptime and
> connection quality. The current view is really too poor compared to the
> TTM view.

We have anchors in the making in Africa, US, and Australia, amongst other
places, therefore the bias will change over time and will converge more to
what TTM (used to) use.

In the meantime, we opted not to use smaller/home probes. We tried, but it
really affected the results. It also caused artificial difference in
monitoring across different zones; the ones that used more flaky probes
showed up as less stable, though they were not.

This lead us to use anchors only for these measurements.

> Access to raw data: it would be useful to have a way to locate the raw
> data efficiently. I suppose the measurement id's are likely to change
> over time, so either a fixed identifier (like Gert Goering suggested)
> could help, or an API to retrieve things like "the msm_id's contributing
> for <tld> dnsmon.". I'd humbly suggest to provide that rather quick: as
> long as anycast instances are not visible via the user interface, it
> would be helpful to retrieve at least the information from the
> hostname.bind measurements without duplicating them.

We intend to keep the measurement IDs as stable as we can, even if we need
to involve new probes in a measurement. However, you have a point that
tagging would be even more useful. We'll implement that once we have proper
support from the Atlas backend.

> Frequency: if  frequency of measurement is discussed, I'd prefer rather
> frequent SOA queries (certainly better than 1/5min) over hostname.bind
> queries, for the simple reason that I'd suspect routing to be more
> stable than DNS data (hand waving...). To me the the data propagation
> delay certainly is an interesting data point.

This is one question where we will need more input from our testers;
opinions we heard so far (officially or not) differ widely.

> That's it for the time being. Feel free to criticize!

Regards,
Robert, for the team

> Best,
> Gilles
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2014-02-13 at 15.49.35.png
Type: image/png
Size: 109905 bytes
Desc: not available
Url : https://www.ripe.net/mailman/private/dnsmon-test/attachments/20140213/793f783a/attachment-0001.png