⚓ T257527 automatically collect network error reports from users' browsers (Network Error Logging API)

Subject	Repo	Branch	Lines +/-
Add rate of high-signal NELs as a status page metric	operations/puppet	production	+7 -0
Filter prom-exported NEL stats to <=10min old reports	operations/puppet	production	+2 -1
NEL alert is empirically high-signal & should page SRE	operations/puppet	production	+1 -0
Add IRC alerting for two relevant NEL subtypes	operations/puppet	production	+14 -0
Add es_exporter config for NEL events	operations/puppet	production	+29 -0
logstash: add throttle-exempt; don't throttle NEL or client errors	operations/puppet	production	+36 -34
launch Network Error Logging on all WMF domains	operations/puppet	production	+14 -7
eventgate-logging-external-tls-proxy: bump CPU up	operations/deployment-charts	master	+22 -0
Extend NEL to group1 wikis; lower sampling rate to 5%	operations/puppet	production	+29 -20
Serve Network Error Logging headers on group0	operations/puppet	production	+61 -0
logstash NEL: use 'tags' not 'type'	operations/puppet	production	+16 -12
collector7: fix omitted 'type' stanza	operations/puppet	production	+2 -0
logstash: NEL: rename overloaded body field to report_body	operations/puppet	production	+42 -0
original logstash: accept Network Error Logging reports	operations/puppet	production	+22 -0
logstash collectors: accept Network Error Logging reports	operations/puppet	production	+24 -0
Created w3c/reportingapi: report fragment & network-error	schemas/event/primary	master	+777 -0
Update eventgate dependency for NEL	eventgate-wikimedia	master	+2 -2

CDanis merged a task: T207860: Collect client network errors, deprecation, intervention and crash reports.Jul 8 2020, 11:28 PM

CDanis added subscribers: • Gilles, ssingh, Peter.

CDanis removed subscribers: Peter, • Gilles.Jul 9 2020, 12:16 PM

e.g. it might be useful to also have them in their own table in Hive.

If we do all the EventGate stuff right, importing them into Hive will be natural (and soon more automated).

Attempt some sort of alerting based on receiving NELs.

The simplest version of this could be just throughput monitoring on the NEL topics in Kafka:
https://github.com/wikimedia/puppet/blob/e936a98f2aebe559eab6777d3b712af1079e3350/modules/monitoring/manifests/alerts/kafka_topic_throughput.pp

Nirmos subscribed.Jul 9 2020, 11:52 PM

• jlinehan subscribed.Jul 10 2020, 12:19 PM

Determine whether or not we want additional stream processing to split apart NEL responses into their component events, as each POST made to the reporting endpoint is potentially a batch of multiple error reports.

Oo, interesting, how are they batched? If they are POSTed as an array of the same NEL object that all conform to the same schema, EventGate will treat them as individual events and post them as individual messages to Kafka.

In T257527#6303123, @Ottomata wrote:

Determine whether or not we want additional stream processing to split apart NEL responses into their component events, as each POST made to the reporting endpoint is potentially a batch of multiple error reports.

Oo, interesting, how are they batched? If they are POSTed as an array of the same NEL object that all conform to the same schema, EventGate will treat them as individual events and post them as individual messages to Kafka.

Looking at the spec, I think that's how they're reported: https://w3c.github.io/reporting/#sample-reports

So that's great!

CDanis claimed this task.Jul 17 2020, 3:46 PM

CDanis triaged this task as Medium priority.

CDanis renamed this task from automatically collect network error reports from users' browsers to automatically collect network error reports from users' browsers (Network Error Logging API).Jul 29 2020, 5:10 PM

CDanis mentioned this in T261340: 'skip_first' feature flag for gdnsd GeoIP plugin.Aug 26 2020, 5:36 PM

CDanis added a subtask: T261340: 'skip_first' feature flag for gdnsd GeoIP plugin.

CDanis updated the task description. (Show Details)

• jlinehan added a project: Product-Data-Infrastructure.Aug 26 2020, 6:13 PM

Change 623005 had a related patch set uploaded (by CDanis; owner: Ottomata):
[eventgate-wikimedia@master] Update eventgate dependency for NEL

https://gerrit.wikimedia.org/r/623005

gerritbot added a project: Patch-For-Review.Aug 28 2020, 3:05 PM

Change 623005 merged by CDanis:
[eventgate-wikimedia@master] Update eventgate dependency for NEL

https://gerrit.wikimedia.org/r/623005

CDanis updated the task description. (Show Details)Aug 28 2020, 3:19 PM

Change 623067 had a related patch set uploaded (by CDanis; owner: CDanis):
[schemas/event/primary@master] Created reportingapi/report 1.0.0

https://gerrit.wikimedia.org/r/623067

• sdkim moved this task from Inbox to Watching on the Product-Data-Infrastructure board.Aug 31 2020, 3:08 PM

Change 623067 merged by Ottomata:
[schemas/event/primary@master] Created w3c/reportingapi: report fragment & network-error

https://gerrit.wikimedia.org/r/623067

CDanis updated the task description. (Show Details)Sep 4 2020, 7:21 PM

CDanis updated the task description. (Show Details)Sep 4 2020, 7:42 PM

Rollout planning braindump

There's three degrees of freedom to play with here:

The set of domains for which we request reports
The sampling fraction we set for all of/each of those (when a user agent sees an error, how often does it create a report for that error?)
The TTL we set for how long user agents will persist the above

I'm thinking the set of domains should roughly follow wiki deployment groups, as they're intended for phased rollouts that roughly follow wiki 'importance' / size of userbase affected, and also they make intuitive sense & are easy to explain).

I'm also thinking that TTLs should initially be short: let's say 1 hour to start with, and later on increasing up to something on the order of a week? The tradeoff to be made with TTLs is:

on a long TTL, if we also set the sampling fraction too high in a record that then persists for a long time, then eventgate and/or logstash could be overwhelmed in the event of a large outage.

on a too-short TTL, we won't get reports at all when infrequent users experience errors

We could probably do some analysis to figure out the per-user distribution of pageview inter-arrival times (or maybe that's already known? # of 1-day active users vs 7da vs 30da gives you enough of an idea), but I was thinking to keep it simple and just say "1 hour TTL to start, 1 week TTL when we're confident". The sampling fraction matters much more, anyway.

As for setting a sampling fraction, that's where things get really tricky.

Some domains are vastly more popular than other domains. There are also per-domain concerns, like vastly different geographical and geopolitical distributions of users, etc. In the long run, we very likely want different sampling fractions for different domains.
A sampling fraction that gives you visibility into interesting, small-scale events (one small ISP suffering from one of our sites being unreachble, for some reason) is very likely to produce an overwhelming flood of reports in the event of a large-scale outage (of Wikimedia's infra, or of a major transit network, etc).

I'm still not sure how best to resolve all the tensions here, and certainly not how to do so in a principled way. Thoughts are appreciated!

For now I think it could be okay to empirically set a sampling rate that seems reasonable, starting off low and then ramping up (but not too high).

CDanis updated the task description. (Show Details)Sep 4 2020, 8:30 PM

Super cool work!

We could probably do some analysis to figure out the per-user distribution of pageview inter-arrival times (or maybe that's already known? # of 1-day active users vs 7da vs 30da gives you enough of an idea),

Let us know if there is any data we can get you.

Thoughts that come to mind:

For sampling ratios I think we want to look at user requests, not pageviews or devices.

Some napkin math follows:

The peak of pageviews last week was 27 million per hour. Say pageviews are made of about 20 requests in average, the peak of requests per second would be 150,000 for all domains for user traffic alone (this is an over estimate)

For last quarter Chrome (all versions) was about 22.2% and Edge (all versions) was 4.2%, of *all pageview traffic*. That makes 26.4% . Extrapolating to all traffic, that would leave us about 40,000 requests per second potentially affected by this protocol (all domains). That seems a lot of traffic that might be potentially affected. We can enable just a small wikipedia, like pt.wikipedia that represents 2% of traffic (800 requests per second of Chrome plus Edge requests ) at 1%, which will give us 8 request per second (again, an overestimate). That seems easy to try and not likely to cause problems with a TTL of 1 hour. Probably a 10% sampling is also easily sustainable.

Links:

https://bit.ly/3jLkRV6
https://superset.wikimedia.org/r/314
https://superset.wikimedia.org/r/313

Screenshots:

Screen Shot 2020-09-04 at 1.58.46 PM.png (1×1 px, 108 KB)

Screen Shot 2020-09-04 at 1.27.22 PM.png (1×1 px, 270 KB)

Screen Shot 2020-09-04 at 1.24.13 PM.png (1×1 px, 252 KB)

Screen Shot 2020-09-04 at 1.23.59 PM.png (1×1 px, 256 KB)

on a long TTL, if we also set the sampling fraction too high in a record that then persists for a long time, then eventgate and/or logstash could be overwhelmed in the event of a large outage.

This (I think) is also eventgate's responsability to prevent and we should have some throttling policies that we can apply per topic (cc @Ottomata)

Change 627364 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] logstash collectors: accept Network Error Logging reports

https://gerrit.wikimedia.org/r/627364

Mentioned in SAL (#wikimedia-operations) [2020-09-14T21:24:01Z] <cdanis> T257527 ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕠🍺 sudo cumin 'R:Class ~ "(?i)profile::logstash::collector7"' 'disable-puppet "cdanis rolling out Ifa3c68e4"'

Change 627364 merged by CDanis:
[operations/puppet@production] logstash collectors: accept Network Error Logging reports

https://gerrit.wikimedia.org/r/627364

Mentioned in SAL (#wikimedia-operations) [2020-09-14T21:30:36Z] <cdanis> T257527 ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕠🍺 sudo cumin 'R:Class ~ "(?i)profile::logstash::collector7"' 'enable-puppet "cdanis rolling out Ifa3c68e4"'

Gehel mentioned this in T262942: PoC on anomaly detection with Flink.Sep 15 2020, 3:14 PM

Change 627582 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] original logstash: accept Network Error Logging reports

https://gerrit.wikimedia.org/r/627582

In T257527#6437113, @CDanis wrote:

Rollout planning braindump

There's three degrees of freedom to play with here:

The set of domains for which we request reports

The sampling fraction we set for all of/each of those (when a user agent sees an error, how often does it create a report for that error?)

The TTL we set for how long user agents will persist the above

I'm thinking the set of domains should roughly follow wiki deployment groups, as they're intended for phased rollouts that roughly follow wiki 'importance' / size of userbase affected, and also they make intuitive sense & are easy to explain).

I'm also thinking that TTLs should initially be short: let's say 1 hour to start with, and later on increasing up to something on the order of a week? The tradeoff to be made with TTLs is:

on a long TTL, if we also set the sampling fraction too high in a record that then persists for a long time, then eventgate and/or logstash could be overwhelmed in the event of a large outage.

on a too-short TTL, we won't get reports at all when infrequent users experience errors

We could probably do some analysis to figure out the per-user distribution of pageview inter-arrival times (or maybe that's already known? # of 1-day active users vs 7da vs 30da gives you enough of an idea), but I was thinking to keep it simple and just say "1 hour TTL to start, 1 week TTL when we're confident". The sampling fraction matters much more, anyway.

As for setting a sampling fraction, that's where things get really tricky.

Some domains are vastly more popular than other domains. There are also per-domain concerns, like vastly different geographical and geopolitical distributions of users, etc. In the long run, we very likely want different sampling fractions for different domains.

Thanks very much for starting this task and all your work on it so far!

Some thoughts:

For the TTL (defined by max_age member), there seem to be two TTLs we have to think about: the TTL for report_tothat specifies the endpoint group where the NEL reports will be sent to, and the other for the NEL policy itself. The TTL for the endpoint can be greater than or equal to the TTL of the NEL policy. (As per the standard: "If the Reporting policy expires, NEL reports will not be delivered, even if the NEL policy has not expired.")

The sampling fraction also has two members: success_fraction that takes into account the successful network requests, and failure_fraction that is concerned with failed network requests. I am trying to think if there is a use case where we may be concerned about the requests that completed successfully. Do you have such a case in mind? It's possible that I am looking at this from a censorship measurement perspective and therefore I am more concerned about the requests that didn't complete successfully -- which makes me think that we can have a larger sampling fraction for failure_fraction but a smaller one for success_fraction, if we need it at all.

Nuria has commented about rate-limiting on EventGate's end but even ignoring that for a second, given that the resource for which we are enabling NEL also shares the same infrastructure as the NEL endpoint (if this is assumption is true), assuming wikimedia.org goes down, shouldn't nel-endpoint.wikimedia.org also be affected? In which case the concern is the large number of reports that are queued, can we also set retry_after for the endpoint to dictate the time after which the report should be (tried) submitted again? That way we can set a long TTL and a large retry_after to help manage the ingestion of reports.

Change 627582 merged by CDanis:
[operations/puppet@production] original logstash: accept Network Error Logging reports

https://gerrit.wikimedia.org/r/627582

Ottomata mentioned this in T262626: Remove http.client_ip from EventGate default schema (again).Sep 15 2020, 6:45 PM

Change 627589 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] logstash: NEL: rename overloaded body field to report_body

https://gerrit.wikimedia.org/r/627589

In T257527#6463988, @ssingh wrote:

For the TTL (defined by max_age member), there seem to be two TTLs we have to think about: the TTL for report_tothat specifies the endpoint group where the NEL reports will be sent to, and the other for the NEL policy itself. The TTL for the endpoint can be greater than or equal to the TTL of the NEL policy. (As per the standard: "If the Reporting policy expires, NEL reports will not be delivered, even if the NEL policy has not expired.")

Yep! I was planning on making the two TTLs equivalent -- I think this will be fine, and it's what the current users do (P12583)

The sampling fraction also has two members: success_fraction that takes into account the successful network requests, and failure_fraction that is concerned with failed network requests. I am trying to think if there is a use case where we may be concerned about the requests that completed successfully. Do you have such a case in mind? It's possible that I am looking at this from a censorship measurement perspective and therefore I am more concerned about the requests that didn't complete successfully -- which makes me think that we can have a larger sampling fraction for failure_fraction but a smaller one for success_fraction, if we need it at all.

I have only vague thoughts for how we might use success_fraction. Here's one thing I've imagined:

Set a high (1.0) success_fraction for some special domains which each map to a single edge datacenter.
Some small fraction of the time, via some Javascript, have a sample of users fetch an identical small asset from each of our N edge datacenters (pointing at each of those N domains).
Use those success reports to aggregate per-ASN / per-IP-block latency numbers for each of our edge DCs, and use that to improve our GeoDNS mapping.

There's a lot of handwaving and open questions embedded in the above, though 😅

For now, we'll set success_fraction: 0.

Nuria has commented about rate-limiting on EventGate's end but even ignoring that for a second, given that the resource for which we are enabling NEL also shares the same infrastructure as the NEL endpoint (if this is assumption is true), assuming wikimedia.org goes down, shouldn't nel-endpoint.wikimedia.org also be affected? In which case the concern is the large number of reports that are queued, can we also set retry_after for the endpoint to dictate the time after which the report should be (tried) submitted again? That way we can set a long TTL and a large retry_after to help manage the ingestion of reports.

Ah, I think you got confused by the spec, which doesn't clearly differentiate between origin-provided configuration and user-agent-internal state. retry_after isn't something we publish; it's a value computed and maintained by the user-agent itself. (Same for failures.)

To address the stated assumption: I think it's incredibly unlikely that we could get so many NELs that we overwhelm the Traffic infrastructure itself*. I think we could potentially overload Logstash, but likely only temporarily, and finally I think we could easily overwhelm the eventgate-logging-external EventGate deployment running on Kubernetes, which at present is just a handful of replicas. But we can easily scale up that last piece, if needed.

(*: The rate at which NELs can be emitted by user-agents is naturally limited by the rate at which users can request pages; one NEL upload can contain many reports in its payload; NELs are 'cheaper' to serve than actual page fetches on most axes at most layers of the stack.)

Change 627589 merged by CDanis:
[operations/puppet@production] logstash: NEL: rename overloaded body field to report_body

https://gerrit.wikimedia.org/r/627589

Change 627591 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] collector7: fix omitted 'tags' stanza

https://gerrit.wikimedia.org/r/627591

Change 627591 merged by CDanis:
[operations/puppet@production] collector7: fix omitted 'type' stanza

https://gerrit.wikimedia.org/r/627591

Change 627599 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] logstash NEL: use 'tags' not 'type'

https://gerrit.wikimedia.org/r/627599

Change 627599 merged by CDanis:
[operations/puppet@production] logstash NEL: use 'tags' not 'type'

https://gerrit.wikimedia.org/r/627599

MoritzMuehlenhoff subscribed.Sep 16 2020, 3:07 PM

• ema subscribed.Sep 16 2020, 3:08 PM

jbond subscribed.Sep 16 2020, 3:49 PM

CDanis updated the task description. (Show Details)Sep 16 2020, 5:16 PM

CDanis mentioned this in T226986: Client side error logging production launch.Sep 16 2020, 5:27 PM

• Nuria closed subtask T262087: Deploy an updated eventgate-logging-external with NEL patches as Resolved.Sep 17 2020, 4:16 PM

Change 627629 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Serve Network Error Logging headers on group0

https://gerrit.wikimedia.org/r/627629

CDanis updated the task description. (Show Details)Sep 18 2020, 9:46 PM

Dzahn subscribed.Sep 18 2020, 9:49 PM

Change 627629 merged by CDanis:
[operations/puppet@production] Serve Network Error Logging headers on group0

https://gerrit.wikimedia.org/r/627629

CDanis updated the task description. (Show Details)Sep 21 2020, 2:51 PM

As of about 13:30 UTC today, we started serving these response headers on group0 wiki domains.

There's a surprising amount of two classes of traffic:

status_code=404 traffic for URLs that are hotlinked everywhere despite not existing (the most prominent of which is https://www.mediawiki.org/w/skins/common/images/cc-0.png )
failure type abandoned which is when the user hits their browser's 'Stop' button, or hits 'Back' or otherwise gives up on the pageload

There is a lot of the second kind of traffic. It's not tremendously diagnostic of anything for us -- well, unless we're seeing many such reports which also have a long elapsed_time, which might indicate latency/timeout issues somewhere in the stack. And there's so much abandoned that it will limit the sampling_fraction we can set on the larger wikis.

There's also a fair bit of phase=application, status_code=200, error type=unknown reports. I'm not sure what these mean.

CDanis updated the task description. (Show Details)Sep 21 2020, 8:21 PM

CDanis closed subtask T261340: 'skip_first' feature flag for gdnsd GeoIP plugin as Resolved.Sep 22 2020, 1:20 PM

So, comparing the same 24h window (19:00 UTC Tuesday -- 19:00 UTC Wednesday) between Logstash and webrequest_sampled_128 for group0 domains:

On group0 we have a sampling rate of 1.0 and Logstash shows 18,795 reports
Turnilo shows 46.4k*128 =~ 5.94M requests, successful or otherwise*, matching the URI Host regexp used in the VCL that serves the response headers

(*: There's lots of hidden complexity here, because some of the error reports we receive are for requests that either the user and/or the browser later issued retries for, possibly after we also logged the original, failed request(s)! At the same time, there are error reports we received that correspond to requests we never received in the first place. We'll staunchly ignore this: it's very hard to model, and since our only goal in modeling is to pick a reasonable sampling fraction, we only need something approximate, we can tune it later, and finally, we can cross our fingers and assume/hope that the relative ratios of all these flavors of error will remain the same. 🤞)

So, all that said, we receive reports at about 0.32% the rate we receive requests.

Looking at the webrequest table for the past week, we receive about 128*81.4M requests/day (~121K/second). 52% of this traffic is Chrome or Edge (@Nuria I think your 22.2% figure didn't include mobile Chrome...? but as turns out mobile does send NELs). So, if we were to sample 100% of errors on all traffic, and the above holds, it would be somewhere in the neighborhood of 200 reports/second steady-state.

We don't want to hit Logstash too hard with new events. It receives approx 92M events/day (~1065/second). We should limit # of reports to be at most a few percent of that. We also want to leave headroom for when we or ISPs have outages.

For future rollouts I am going to begin with 5% sampling. An average of 10 events/second seems pretty workable for Logstash, we'll still get a fair bit of data, and this gives us plenty of headroom in the event of a lot of temporally-correlated failures.

I'll do group1 tomorrow, and will continue on to all domains (group2 plus non-wiki domains) after the weekend.

CDanis updated the task description. (Show Details)Sep 23 2020, 10:50 PM

Change 629717 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Extend NEL to group1 wikis; lower sampling rate to 5%

https://gerrit.wikimedia.org/r/629717

Change 629717 merged by CDanis:
[operations/puppet@production] Extend NEL to group1 wikis; lower sampling rate to 5%

https://gerrit.wikimedia.org/r/629717

Correct, I did not included chrome mobile which is about 17.8% of pageviews for last month, that would mean that (pageview wise). The total amount of pageviews by browsers that would send us reports is 44.2%. It makes sense that this would be an underestimate cause usage of wikipedia content that is not a pageview also matters for this purpose.

Screen Shot 2020-09-24 at 2.42.53 PM.png (1×1 px, 223 KB)

CDanis added a subtask: Restricted Task.Sep 25 2020, 1:17 PM

Change 630597 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/deployment-charts@master] eventgate-logging-external-tls-proxy: bump CPU up

https://gerrit.wikimedia.org/r/630597

CDanis updated the task description. (Show Details)Sep 28 2020, 2:02 PM

Change 630597 merged by CDanis:
[operations/deployment-charts@master] eventgate-logging-external-tls-proxy: bump CPU up

https://gerrit.wikimedia.org/r/630597

With the current 5% sampling, we're getting about 30 reports/second at peak times.

Extending NEL to all domains is going to increase our traffic by about 2.5x (assuming the same ratio of requests recorded / errors reported).

That would bring us up to something like 75 reports/sec at peak, which feels like it should be fairly manageable for Logstash (adding something like 7% to its overall load).

I think this is acceptable, but I'll be keeping an eye on things. At 5% sampling we're already seeing more error reports than I anticipated, given the previous ratio of requests/reports, so it's possible 5% is too much and we need to tune downwards.

Change 630860 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] launch Network Error Logging on all WMF domains

https://gerrit.wikimedia.org/r/630860

dcausse subscribed.Sep 29 2020, 4:15 PM

Change 630860 merged by CDanis:
[operations/puppet@production] launch Network Error Logging on all WMF domains

https://gerrit.wikimedia.org/r/630860

This is now live on all WMF domains.

In the event that it needs to be backed out, don't revert the patch.

Instead, continue serving the response headers, but modify the NEL: header failure_fraction to be 0.0.

This is because NEL-enabled browsers cache the response header for the TTL we specify (24h), and a missing response header won't affect the state of that cache in any way.

Change 630931 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] logstash: add throttle-exempt; don't throttle NEL or client errors

https://gerrit.wikimedia.org/r/630931

Change 630931 merged by Cwhite:
[operations/puppet@production] logstash: add throttle-exempt; don't throttle NEL or client errors

https://gerrit.wikimedia.org/r/630931

CDanis mentioned this in T265765: Quick data exploration CLI.Oct 29 2020, 2:07 PM

CDanis added a subtask: T266906: update logging ES's template index to type the 'age' field as an integer.Oct 30 2020, 9:38 PM

CDanis updated the task description. (Show Details)Nov 30 2020, 7:21 PM

CDanis mentioned this in T270664: Was unable to connect (esams) for about 20 minutes.Dec 22 2020, 3:06 PM

colewhite closed subtask T266906: update logging ES's template index to type the 'age' field as an integer as Resolved.Feb 5 2021, 9:26 PM

CDanis closed subtask T263496: Augment NEL reports with GeoIP country code and network AS number as Resolved.Mar 10 2021, 9:35 PM

Change 679417 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] Add es_exporter config for NEL events

https://gerrit.wikimedia.org/r/679417

Change 679417 merged by CDanis:

[operations/puppet@production] Add es_exporter config for NEL events

https://gerrit.wikimedia.org/r/679417

CDanis updated the task description. (Show Details)Apr 14 2021, 9:42 PM

Change 685516 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] Add IRC alerting for two relevant NEL subtypes

https://gerrit.wikimedia.org/r/685516

Change 685516 merged by CDanis:

[operations/puppet@production] Add IRC alerting for two relevant NEL subtypes

https://gerrit.wikimedia.org/r/685516

CDanis updated the task description. (Show Details)May 5 2021, 6:08 PM

Change 727594 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] NEL alert is empirically high-signal & should page SRE

https://gerrit.wikimedia.org/r/727594

CDanis added a subtask: T292870: externally-hosted NEL report forwarders for more timely report reception.Oct 8 2021, 7:11 PM

Change 727594 merged by CDanis:

[operations/puppet@production] NEL alert is empirically high-signal & should page SRE

https://gerrit.wikimedia.org/r/727594

Change 731166 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] Filter prom-exported NEL stats to <=10min old reports

https://gerrit.wikimedia.org/r/731166

Change 731166 merged by CDanis:

[operations/puppet@production] Filter prom-exported NEL stats to <=10min old reports

https://gerrit.wikimedia.org/r/731166

Change 731171 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] Add rate of high-signal NELs as a status page metric

https://gerrit.wikimedia.org/r/731171

Change 731171 merged by CDanis:

[operations/puppet@production] Add rate of high-signal NELs as a status page metric

https://gerrit.wikimedia.org/r/731171

Aklapper removed a project: Patch-For-Review.Aug 16 2022, 12:20 PM

CDanis mentioned this in T316160: improve GeoDNS-to-edge mapping.Aug 24 2022, 8:56 PM

Status	Assigned	Task
Open	CDanis	T257527 automatically collect network error reports from users' browsers (Network Error Logging API)
Resolved	CDanis	T261340 'skip_first' feature flag for gdnsd GeoIP plugin
Resolved	Ottomata	T262087 Deploy an updated eventgate-logging-external with NEL patches
Resolved	CDanis	T263496 Augment NEL reports with GeoIP country code and network AS number
		Restricted Task
Open	None	T264021 > ~1 request/second to intake-logging.wikimedia.org times out at the traffic/service interface
Open	None	T266886 Augment NEL reports with a computed timestamp-of-generation
Resolved	colewhite	T266906 update logging ES's template index to type the 'age' field as an integer
Open	None	T292870 externally-hosted NEL report forwarders for more timely report reception
Open	None	T303725 Extend NEL headers to sites not fronted by CDN
Open	None	T304373 Also intake Network Error Logging events into the Analytics Data Lake

automatically collect network error reports from users' browsers (Network Error Logging API)
Open, MediumPublic
Actions

Description

Privacy concerns

v0: Minimum viable deployment:

v1: Improvements that aren't too hard

Harder / more open-ended future work:

Details

Related Objects
Search...

Event Timeline

Rollout planning braindump

Rollout planning braindump

	F32362574: Screen Shot 2020-09-24 at 2.42.53 PM.png
	Sep 24 2020, 9:57 PM

	F32245738: Screen Shot 2020-09-04 at 1.27.22 PM.png
	Sep 4 2020, 9:43 PM

	F32245731: Screen Shot 2020-09-04 at 1.58.46 PM.png
	Sep 4 2020, 9:43 PM

	F32245735: Screen Shot 2020-09-04 at 1.23.59 PM.png
	Sep 4 2020, 9:43 PM

	F32245733: Screen Shot 2020-09-04 at 1.24.13 PM.png
	Sep 4 2020, 9:43 PM

	CDanis
	Jul 8 2020, 10:48 PM

automatically collect network error reports from users' browsers (Network Error Logging API)Open, MediumPublicActions

Description

Privacy concerns

v0: Minimum viable deployment:

v1: Improvements that aren't too hard

Harder / more open-ended future work:

Details

Related ObjectsSearch...

Event Timeline

Rollout planning braindump

Rollout planning braindump

automatically collect network error reports from users' browsers (Network Error Logging API)
Open, MediumPublic
Actions

Related Objects
Search...