Page MenuHomePhabricator

RIPE Atlas monitoring of reachability & latency towards anycasted Wikidough IP
Open, MediumPublic

Description

As (re-)discovered in T283359: Create RIPE Atlas measurements against our authoritative DNS servers; alert on them, in practice anycast routing can offer many surprises.

It's possible we'll need to balance user latency vs traffic engineering configuration complexity when doing anycasting for real services.

Either way, we'll still want to monitor latency and reachability of Wikidough from many vantage points on the Internet.

This is [for now] a placeholder task to configure RIPE Atlas for this.

Event Timeline

Marostegui removed a project: SRE.

This service is now live in codfw, answering DoH, DoT and plain-old DNS.

I believe dnsdist is terminating the DoH/DoT, but regular queries on UDP/TCP 53 go directly to Power DNS Recursor. No problem with that, but worth noting we only get something back for NSID when we talk to dnsdist, and I assume with the Atlas probes we can only make regular queries.

root@debiantest:~# dig +nsid +https www.ietf.org @wikimedia-dns.org

; <<>> DiG 9.17.13-2+0~20210520.56+debian11~1.gbp96c80e-Debian <<>> +nsid +https www.ietf.org @wikimedia-dns.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39806
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; NSID: 6d 61 6c 6d 6f 6b ("malmok")
;; QUESTION SECTION:
;www.ietf.org.			IN	A

;; ANSWER SECTION:
www.ietf.org.		944	IN	CNAME	www.ietf.org.cdn.cloudflare.net.
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.44.99
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.45.99

;; Query time: 159 msec
;; SERVER: 185.71.138.138#443(wikimedia-dns.org) (HTTPS)
;; WHEN: Wed May 26 18:11:11 BST 2021
;; MSG SIZE  rcvd: 128
root@debiantest:~# dig +nsid +tls www.ietf.org @wikimedia-dns.org

; <<>> DiG 9.17.13-2+0~20210520.56+debian11~1.gbp96c80e-Debian <<>> +nsid +tls www.ietf.org @wikimedia-dns.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50986
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; NSID: 6d 61 6c 6d 6f 6b ("malmok")
;; QUESTION SECTION:
;www.ietf.org.			IN	A

;; ANSWER SECTION:
www.ietf.org.		275	IN	CNAME	www.ietf.org.cdn.cloudflare.net.
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.44.99
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.45.99

;; Query time: 348 msec
;; SERVER: 185.71.138.138#853(wikimedia-dns.org) (TLS)
;; WHEN: Wed May 26 18:22:20 BST 2021
;; MSG SIZE  rcvd: 128

EDIT: Removed incorrect info on regular DNS queries, managed to confuse myself.

This service is now live in codfw, answering DoH, DoT and plain-old DNS.

I believe dnsdist is terminating the DoH/DoT, but regular queries on UDP/TCP 53 go directly to Power DNS Recursor. No problem with that, but worth noting we only get something back for NSID when we talk to dnsdist, and I assume with the Atlas probes we can only make regular queries.

root@debiantest:~# dig +nsid +https www.ietf.org @wikimedia-dns.org

; <<>> DiG 9.17.13-2+0~20210520.56+debian11~1.gbp96c80e-Debian <<>> +nsid +https www.ietf.org @wikimedia-dns.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39806
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; NSID: 6d 61 6c 6d 6f 6b ("malmok")
;; QUESTION SECTION:
;www.ietf.org.			IN	A

;; ANSWER SECTION:
www.ietf.org.		944	IN	CNAME	www.ietf.org.cdn.cloudflare.net.
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.44.99
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.45.99

;; Query time: 159 msec
;; SERVER: 185.71.138.138#443(wikimedia-dns.org) (HTTPS)
;; WHEN: Wed May 26 18:11:11 BST 2021
;; MSG SIZE  rcvd: 128
root@debiantest:~# dig +nsid +tls www.ietf.org @wikimedia-dns.org

; <<>> DiG 9.17.13-2+0~20210520.56+debian11~1.gbp96c80e-Debian <<>> +nsid +tls www.ietf.org @wikimedia-dns.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50986
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; NSID: 6d 61 6c 6d 6f 6b ("malmok")
;; QUESTION SECTION:
;www.ietf.org.			IN	A

;; ANSWER SECTION:
www.ietf.org.		275	IN	CNAME	www.ietf.org.cdn.cloudflare.net.
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.44.99
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.45.99

;; Query time: 348 msec
;; SERVER: 185.71.138.138#853(wikimedia-dns.org) (TLS)
;; WHEN: Wed May 26 18:22:20 BST 2021
;; MSG SIZE  rcvd: 128
root@debiantest:~# dig +nsid www.ietf.org @wikimedia-dns.org

; <<>> DiG 9.17.13-2+0~20210520.56+debian11~1.gbp96c80e-Debian <<>> +nsid www.ietf.org @wikimedia-dns.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35293
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1220
; COOKIE: bd8627f07957ae320100000060ae81eb55bd3b21e9c23314 (good)
;; QUESTION SECTION:
;www.ietf.org.			IN	A

;; ANSWER SECTION:
www.ietf.org.		1800	IN	CNAME	www.ietf.org.cdn.cloudflare.net.
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.44.99
www.ietf.org.cdn.cloudflare.net. 300 IN	A	104.16.45.99

;; Query time: 2467 msec
;; SERVER: 185.71.138.138#53(wikimedia-dns.org) (UDP)
;; WHEN: Wed May 26 18:14:19 BST 2021
;; MSG SIZE  rcvd: 146

Thanks for all the help with this task!

I just wanted to clarify,

I believe dnsdist is terminating the DoH/DoT, but regular queries on UDP/TCP 53 go directly to Power DNS Recursor.

Wikidough -- by design -- only supports DoH and DoT, so we listen on ports TCP/443 and TCP/853. We have no plans to support UDP/53 or TCP/53. PowerDNS Recursor listens on localhost/53 for queries from dnsdist but does not answer queries directly. (More at https://wikitech.wikimedia.org/wiki/Wikidough#Design.)

Ok thanks. That actually makes sense, and what I had originally expected.

I have fallen into a trap I've hit before, in that my home network here is secretly redirecting all plain text DNS to my own resolver. And thus when I checked if Wikidough would answer on UDP/53 it looked like it was. I'm a muppet.

Testing from elsewhere it does indeed time out:

root@nyc2:~# dig A en.wikipedia.org @wikimedia-dns.org

; <<>> DiG 9.16.16-Ubuntu <<>> A en.wikipedia.org @wikimedia-dns.org
;; global options: +cmd
;; connection timed out; no servers could be reached

Not sure what that means for the Atlas probes but we will discuss in SRE. Thanks!

RIPE Atlas probes support sending DoT queries; however, the option is not exposed anywhere in the measurement creation web UI, nor in the official ripe-atlas-tools distribution. The unofficial blaeu tools for creating one-off measurements support sending the option, and there is also a pull request against ripe-atlas-tools (lingering almost two years now).

The RIPE Atlas network does not support DoH by policy.

RIPE atlas measurements have a lot of limitations when they're recurrent and not part of the Anchor network, which make them less ideal for monitoring Wikidough. For example the number of usable probes per test is relatively low (~400).

There are 2 aspects to cover:

  • general reachability to our anycast prefixes, regardless of latency

This is done with a combination of the existing RIPE Anchors (connectivity to the DCs), possibly BGP alerter (prefixes health), usage stats, maybe NEL, and maybe censorship detection tools?

  • Sub-optimal routing, to be tuned with T288843

Here again RIPE would only give us a partial view of the world. We can however leverage real users traffic and estimated location. By comparing their theoretical ideal POP with the one serving the user.

For each client IP we could do a GeoIP lookup and store at least their location (country), AS#, POP and if possible client latency. No need for the IP itself to keep privacy at the highest.
We can then compare those country/POP pairs with a list such at the AuthDNS one to highlight providers that need attention and better tuning.

@ssingh do we currently send anything to analytics? What do you think of the short proposal above?