[Dnsmon-test] Native Resolution etc

Daniel Karrenberg daniel.karrenberg at ripe.net
Tue Mar 11 18:15:59 CET 2014


On 11.03.2014, at 13:51 , Robert Kisteleki <robert at ripe.net> wrote:

> On 2014.03.11. 12:47, Daniel Karrenberg wrote:
>> 
>> Thank you for making native resolution available. It is very very useful in isolating events in time and in getting a feel for what is happening during short intervals when significant changes happen. For the same reason I would like to see shorter aggregation buckets in other views too when the selected time interval is short, e.g. I want to be able to pinpoint the exact time when multiple servers serving a particular domain degrade. This would enhance the (perception of) usefulness of dnsmon in incident reporting.
> 
> We'll look into this.
> 
>> For the same reason I would like to see a combination view of all measurements using the same transport, e.g. IPv4/UDP.

Please look into this as well. High time resolution is really useful for outage reporting. dnsmon new still looses too much time resolution in comparison to dnsmon classic. I suggest we keep querying at least once per minute per probe and server on average at least for UDP.

>> 
>> What puzzles me are differences between dnsmon classic and new like this example:
>> 
>> 
>> http://dnsmon.ripe.net/dns-servmon/server/plot?server=b.root-servers.net;type=drops;tstart=1392105600;tstop=1392127199;af=ipv4
>> 
>> https://atlas.ripe.net/dnsmon/index-page?dnsmon.server=192.228.79.201&dnsmon.zone=root&dnsmon.type=probes&dnsmon.startTime=1394515417&dnsmon.endTime=1394536533&dnsmon.selectedRows=&dnsmon.isTcp=true&dnsmon.session.show-filter=pls&dnsmon.session.color_range_pls=0-66-66-99-100&dnsmon.session.exclude-errors=true
>> 
>> I have the impression that we are loosing significant signal here. Or is this just an illustration why silent retries were not a good idea?
>> 
>> Daniel


OK. Let's attribute that to retries then for lack of a better explanation then.



This more recent comparison clearly shows that dnsmon classic gets additional resolution by sending several queries at each query time and recording the number of queries answered.

http://dnsmon.ripe.net/dns-servmon/server/plot?type=drops&server=b.root-servers.net&af=ipv4&day=8&month=3&year=2014&hour=4&minutes=0&period=2h&plot=SHOW

https://atlas.ripe.net/dnsmon/index-page?dnsmon.server=192.228.79.201&dnsmon.zone=root&dnsmon.type=probes&dnsmon.startTime=1394251220&dnsmon.endTime=1394258383&dnsmon.selectedRows=&dnsmon.isTcp=true&dnsmon.session.show-filter=pls&dnsmon.session.color_range_pls=0-66-66-99-100&dnsmon.session.exclude-errors=true

It would be preferable if the new implementation used several queries per query time for UDP as well, because it provides significantly more signal for overloaded servers/networks. If the atlas firmware does not allow that at the present time, please put it on the list.



If one assumes that dnsmon classic is correct about the timing of this event:

http://dnsmon.ripe.net/dns-servmon/server/plot?type=drops&server=d.root-servers.net&af=ipv4&day=6&month=3&year=2014&hour=18&minutes=0&period=2h&plot=SHOW

dnsmon new is very subtly different in the times recorded besides the lower resolution:

https://atlas.ripe.net/dnsmon/index-page?dnsmon.server=199.7.91.13&dnsmon.zone=root&dnsmon.type=probes&dnsmon.startTime=1394130009&dnsmon.endTime=1394131735&dnsmon.selectedRows=&dnsmon.isTcp=false&dnsmon.session.show-filter=pls&dnsmon.session.color_range_pls=0-66-66-99-100&dnsmon.session.exclude-errors=true

I cannot immediately see a reason for this, so I suggest to investigate the cause for the discrepancy. There may be differences in time recording that may become very significant when investigating events.

Also the UI in dnsmon new could make it much easier to get to this representation if the "zoom in" button would remain available and if the bottom time selection slider would scale with the selected interval in order to make it useable. I got there with the change time tool eventually although this keeps changing my inputs on me. Suggestion: do not update anything while the user types in values in the change time tool because immediate updates are awkward on low bandwidth connections and do not add much on high bandwidth ones.

Daniel






-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 163 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : https://www.ripe.net/ripe/mail/archives/dnsmon-test/attachments/20140311/7b4c74ea/attachment.bin