Page MenuHomePhabricator

varnishmtail metric loss due to mtail not reading from pipe fast enough
Open, LowPublic

Assigned To
None
Authored By
ema
Oct 20 2021, 11:40 AM
Referenced Files
F34755839: varnishmtail-lines-processed.png
Nov 19 2021, 3:03 PM
F34713805: mtail-lines-total-last-week.png
Oct 28 2021, 8:19 AM
F34713783: mtail-lines-total-last-week.png
Oct 28 2021, 8:07 AM
F34713775: cp3062-varnishmtail-cpu-profile-2.png
Oct 28 2021, 8:07 AM
F34712716: varnishxcache.mtail-perf.png
Oct 27 2021, 8:24 AM
F34710237: varnishttfb.mtail.perf.png
Oct 25 2021, 1:08 PM
F34706099: cp3062-varnishmtail-cpu-profile.png
Oct 22 2021, 7:20 AM
F34701241: cp3062-varnish-2xx.png
Oct 20 2021, 11:40 AM
Subscribers

Description

On multiple cache nodes, but more so on text@esams/eqsin than others, we get frequent errors from varnishmtail. The errors are repeated occurrences of:

Oct 19 05:23:03 cp3062 varnishmtail[15013]: Log overrun
Oct 19 05:23:03 cp3062 varnishmtail[15013]: Log reacquired

The situation is at times so bad that varnishmtail crashes altogether:

Oct 19 05:57:19 cp3062 varnishmtail[15013]: Assert error in vtx_append(), vsl_dispatch.c line 457:
Oct 19 05:57:19 cp3062 varnishmtail[15013]:   Condition(i != vsl_check_e_inval) not true.
Oct 19 05:57:20 cp3062 varnishmtail[15013]: varnishncsa seems to have crashed, exiting
Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Main process exited, code=exited, status=1/FAILURE
Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Failed with result 'exit-code'.
Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Consumed 7h 40min 57.670s CPU time.
Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Service RestartSec=100ms expired, scheduling restart.
Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Scheduled restart job, restart counter is at 6928.
Oct 19 05:57:20 cp3062 systemd[1]: Stopped Varnish mtail.
Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Consumed 7h 40min 57.670s CPU time.
Oct 19 05:57:20 cp3062 systemd[1]: Started Varnish mtail.

Varnishmtail is nothing more than varnishncsa | mtail, so a likely cause for this issue is that mtail is not able to process the input coming from varnishncsa fast enough.

As a consequence, most of the varnish metrics produced by varnishmtail are skewed. Compare the view on 200 responses from ats-tls and varnish:

cp3062-ats-tls-2xx.png (826×1 px, 101 KB)

cp3062-varnish-2xx.png (816×1 px, 136 KB)

The overruns happen pretty uniformly throughout the week:

11:51:07 ema@cp3062.esams.wmnet:~
$ sudo journalctl -u varnishmtail.service --since '1 week ago' | awk '/overrun/ {print $1, $2}' | uniq -c
  15051 Oct 13
  22299 Oct 14
  20593 Oct 15
  16158 Oct 16
  17279 Oct 17
  19051 Oct 18
  19804 Oct 19

And so do the crashes:

11:51:18 ema@cp3062.esams.wmnet:~
$ sudo journalctl -u varnishmtail.service --since '1 week ago' | awk '/vsl_check_e_inval/ {print $1, $2}' | uniq -c
     95 Oct 13
    128 Oct 14
    121 Oct 15
     94 Oct 16
     95 Oct 17
    110 Oct 18
    107 Oct 19

There is a direct correlation between varnishmtail crashes and metric dips in the graphs, as one would expect.

The VarnishTrafficDrop alert is affected by this issue, as well as dashboards used regularly such as frontend-traffic, varnish-caching, and others.

Upstream PRs 3451 and 3468 may be of interest. See T151643 for a problem with the same symptoms but different circumstances tackled in the past.

On the mtail front, this is what 1 minute of CPU profiling looks like while the issue is happening:

cp3062-varnishmtail-cpu-profile.png (3×3 px, 1 MB)

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+0 -8
operations/puppetproduction+1 -1
operations/puppetproduction+8 -0
operations/puppetproduction+18 -1
operations/puppetproduction+94 -62
operations/puppetproduction+40 -14
operations/puppetproduction+64 -28
operations/puppetproduction+3 -12
operations/puppetproduction+1 -11
operations/puppetproduction+21 -50
operations/puppetproduction+0 -11
operations/puppetproduction+2 -2
operations/puppetproduction+0 -11
operations/puppetproduction+30 -19
operations/puppetproduction+1 -24
operations/puppetproduction+15 -16
operations/puppetproduction+7 -5
operations/puppetproduction+4 -6
operations/puppetproduction+14 -56
operations/debs/varnish4debian-wmf+649 -1
operations/alertsmaster+14 -14
operations/debs/varnish4debian-wmf+68 -0
Show related patches Customize query in gerrit

Event Timeline

ema renamed this task from varnishmtail metric loss due to performance issues to varnishmtail metric loss due to mtail not reading from pipe fast enough .Oct 20 2021, 12:10 PM

Mentioned in SAL (#wikimedia-operations) [2021-10-20T12:51:38Z] <ema> cp3062: bump vsl_space from 80M (default) to 512M T293879 - varnish restart needed

Trying the lowest possible hanging fruit first, namely rising vsl_space. I've first tried setting it to 512M as mentioned in the SAL entry above, but that failed due to the size of /var/lib/varnish (512M, but it needs space for other stuff too). We now have vsl_space=480M on cp3062, let's see if that changes anything at all.

Mentioned in SAL (#wikimedia-operations) [2021-10-20T14:27:45Z] <ema> cp3062: test higher vsl_space values T293879

Mentioned in SAL (#wikimedia-operations) [2021-10-21T08:25:32Z] <ema> cp3062: revert vsl_space experiment T293879

By giving a very large amount - 3G instead of the default 80M - of vsl_space to cp3062, the issue happens less often but still does happen. On all text@esams nodes, there has been no overrun between 00:44 and 04:46 (when esams traffic is at its lowest), while on cp3062 the last overrun happened at 21:56 and the first one this EU morning at 06:38. Bumping vsl_space alone does not fix the issue.

Note that there is a very simple repro. A reader like while read h; do echo $h; done is not fast enough for varnishncsa.

# this shows about 7K rps
$ sudo /usr/bin/varnishncsa -n frontend | pv -l -a > /dev/null
# log overrun around 3K rps
$ sudo /usr/bin/varnishncsa -n frontend | while read h; do echo $h; done | pv -l -a > /dev/null

See asciinema recording.

Replacing the while with:

import sys

while True:
    line = sys.stdin.readline()
    if not line:
        break
    sys.stdout.write(line)

I get no overrun.

I've tried using a separate mtail instance with a subset of the scripts used by the production instance, namely:

  • varnisherrors.mtail
  • varnishprocessing.mtail
  • varnishreqstats.mtail
  • varnishsli.mtail
  • varnishttfb.mtail

Not included:

  • varnishrls.mtail
  • varnishxcache.mtail

The instance has been running for several minutes on cp3062 without overruns, while on the same host varnishmtail.service keeps on reporting issues. Next, we should profile mtail and see if we can improve performance by working on the scripts.

Change 732913 had a related patch set uploaded (by Ema; author: Ema):

[operations/debs/varnish4@debian-wmf] Add debian/patches/0008-vsl_check_e_inval_assertion.patch

https://gerrit.wikimedia.org/r/732913

Change 732913 merged by Ema:

[operations/debs/varnish4@debian-wmf] Add debian/patches/0008-vsl_check_e_inval_assertion.patch

https://gerrit.wikimedia.org/r/732913

Change 732925 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnishttfb.mtail: use native histogram type

https://gerrit.wikimedia.org/r/732925

Mentioned in SAL (#wikimedia-operations) [2021-10-22T08:23:48Z] <ema> cp3062: test 0008-vsl_check_e_inval_assertion.patch https://gerrit.wikimedia.org/r/c/operations/debs/varnish4/+/732913/ T293879

varnishmtail has crashed at least once on all but one text@esams nodes in the last ~10 minutes:

cp3050.esams.wmnet:    Active: active (running) since Fri 2021-10-22 12:27:36 UTC; 9min ago
cp3052.esams.wmnet:    Active: active (running) since Fri 2021-10-22 12:35:41 UTC; 1min 10s ago
cp3054.esams.wmnet:    Active: active (running) since Fri 2021-10-22 12:28:18 UTC; 8min ago
cp3056.esams.wmnet:    Active: active (running) since Fri 2021-10-22 12:23:56 UTC; 12min ago
cp3058.esams.wmnet:    Active: active (running) since Fri 2021-10-22 12:29:22 UTC; 7min ago
cp3060.esams.wmnet:    Active: active (running) since Fri 2021-10-22 12:29:24 UTC; 7min ago
cp3064.esams.wmnet:    Active: active (running) since Fri 2021-10-22 12:36:29 UTC; 22s ago

As expected, it's still up on cp3062:

cp3062.esams.wmnet:    Active: active (running) since Fri 2021-10-22 08:25:36 UTC; 4h 11min ago

Action items for next week:

(1) I think we should apply 0008-vsl_check_e_inval_assertion.patch everywhere and consider 3468 too.

On the mtail front it's clear from profile data that most of CPU time is spent on backtracking regular expressions:

cp3062-varnishmtail-cpu-profile.png (3×3 px, 1 MB)

(2) see if any of the mtail scripts can be optimized, in particular the regex parts. Looking at log line processing times it seems that varnishrls.mtail and varnishxcache.mtail may be good starting points

(3) Merge https://gerrit.wikimedia.org/r/732925 and see the performance impact on varnishttfb.mtail

Once this is fixed we will also need to alert on log overruns.

Change 732970 had a related patch set uploaded (by Ema; author: Ema):

[operations/alerts@master] Use ats-tls metrics for edge traffic drop alert

https://gerrit.wikimedia.org/r/732970

Change 732970 merged by Ema:

[operations/alerts@master] Use ats-tls metrics for edge traffic drop alert

https://gerrit.wikimedia.org/r/732970

Change 734249 had a related patch set uploaded (by Ema; author: Ema):

[operations/debs/varnish4@debian-wmf] Add 0009-vsl-perf-stability.patch

https://gerrit.wikimedia.org/r/734249

Mentioned in SAL (#wikimedia-operations) [2021-10-25T11:57:09Z] <ema> deployment-cache-text06: upgrade varnish to 6.0.8-1wm2 T293879

Mentioned in SAL (#wikimedia-operations) [2021-10-25T12:04:09Z] <ema> cp3062: upgrade varnish to 6.0.8-1wm2 T293879

Change 734249 abandoned by Ema:

[operations/debs/varnish4@debian-wmf] Add 0009-vsl-perf-stability.patch

Reason:

No visible improvements for varnishmtail

https://gerrit.wikimedia.org/r/734249

Change 732925 merged by Ema:

[operations/puppet@production] varnishttfb.mtail: use native histogram type

https://gerrit.wikimedia.org/r/732925

Change 732925 merged by Ema:

[operations/puppet@production] varnishttfb.mtail: use native histogram type

https://gerrit.wikimedia.org/r/732925

Using the builtin histogram type in varnishttfb.mtail did improve performance, from 16.7 microseconds per line to ~15.5. Not enough to make a huge difference, but still good to know for other histograms we're using elsewhere (eg: trafficserver_tls_client_ttfb, trafficserver_tls_client_total_time).

varnishttfb.mtail.perf.png (819×1 px, 68 KB)

Mentioned in SAL (#wikimedia-operations) [2021-10-26T08:33:04Z] <ema> upload varnish_6.0.8-1wm2 to component/varnish6 on apt.wm.org T293879

Change 734893 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnishxcache.mtail: avoid unnecessary filtering

https://gerrit.wikimedia.org/r/734893

Change 734893 merged by Ema:

[operations/puppet@production] varnishxcache.mtail: avoid unnecessary filtering

https://gerrit.wikimedia.org/r/734893

Change 734893 merged by Ema:

[operations/puppet@production] varnishxcache.mtail: avoid unnecessary filtering

https://gerrit.wikimedia.org/r/734893

This change reduced the line processing time of varnishxcache.mtail from ~36 to 15 microseconds.

varnishxcache.mtail-perf.png (822×1 px, 79 KB)

Change 734970 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnishreqstats.mtail: remove wildcard match

https://gerrit.wikimedia.org/r/734970

Change 734970 merged by Ema:

[operations/puppet@production] varnishreqstats.mtail: remove wildcard match

https://gerrit.wikimedia.org/r/734970

The optimizations to varnishxcache.mtail and varnishreqstats.mtail paid off, time spent in tryBacktrack has decreased significantly:

cp3062-varnishmtail-cpu-profile-2.png (3×3 px, 1 MB)

With these changes we can now process ~8K lines per second compared to ~6K last week.

mtail-lines-total-last-week.png (819×1 px, 56 KB)

Next steps:

  • upgrade all nodes to varnish 6.0.8-1wm2, which includes our backport of https://github.com/varnishcache/varnish-cache/pull/3451/ to stop varnishncsa from crashing
  • look for other optimizations to mtail scripts
  • move varnishprocessing.mtail (and possibly others) to another varnishmtail instance
  • alert on overruns

Mentioned in SAL (#wikimedia-operations) [2021-10-28T08:30:55Z] <ema> A:cp start rolling varnish upgrades to 6.0.8-1wm2 T293879

Change 735383 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnishrls.mtail: various optimizations

https://gerrit.wikimedia.org/r/735383

Change 735383 merged by Ema:

[operations/puppet@production] varnishrls.mtail: various optimizations

https://gerrit.wikimedia.org/r/735383

Change 737336 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: install varnishxcache like other mtail programs

https://gerrit.wikimedia.org/r/737336

Change 737336 merged by Ema:

[operations/puppet@production] varnish: install varnishxcache like other mtail programs

https://gerrit.wikimedia.org/r/737336

Change 737385 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: use systemd template files for varnishmtail

https://gerrit.wikimedia.org/r/737385

Change 737385 merged by Ema:

[operations/puppet@production] varnish: use systemd template files for varnishmtail

https://gerrit.wikimedia.org/r/737385

Change 737417 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: remove ensure-absent for varnishmtail

https://gerrit.wikimedia.org/r/737417

Change 737424 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: add varnish::logging::mtail

https://gerrit.wikimedia.org/r/737424

Change 737417 merged by Ema:

[operations/puppet@production] varnish: remove ensure-absent for varnishmtail

https://gerrit.wikimedia.org/r/737417

Change 737621 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: notify varnishmtail@default

https://gerrit.wikimedia.org/r/737621

Change 737451 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: remove ensure-absent for varnishmtail

https://gerrit.wikimedia.org/r/737451

Change 737621 merged by Ema:

[operations/puppet@production] varnish: notify varnishmtail@default

https://gerrit.wikimedia.org/r/737621

Change 737451 merged by Ema:

[operations/puppet@production] varnish: remove ensure-absent for varnishmtail

https://gerrit.wikimedia.org/r/737451

Change 737634 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish::logging: add new parameter mtail_programs

https://gerrit.wikimedia.org/r/737634

Change 737634 merged by Ema:

[operations/puppet@production] varnish::logging: pass list of mtail programs from profile

https://gerrit.wikimedia.org/r/737634

Change 737670 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish::logging: remove statsd_host and mtail_progs

https://gerrit.wikimedia.org/r/737670

Change 737670 merged by Ema:

[operations/puppet@production] varnish::logging: remove statsd_host and mtail_progs

https://gerrit.wikimedia.org/r/737670

Change 737734 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: remove support for mtail_additional_args

https://gerrit.wikimedia.org/r/737734

Change 737734 merged by Ema:

[operations/puppet@production] varnish: remove support for mtail_additional_args

https://gerrit.wikimedia.org/r/737734

Change 737424 merged by Ema:

[operations/puppet@production] varnish: add varnish::logging::mtail

https://gerrit.wikimedia.org/r/737424

Change 738910 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: add varnishmtail-wrapper

https://gerrit.wikimedia.org/r/738910

Change 738949 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: move internal mtail scripts to another instance

https://gerrit.wikimedia.org/r/738949

Change 738910 merged by Ema:

[operations/puppet@production] varnish: add varnishmtail-wrapper

https://gerrit.wikimedia.org/r/738910

Mentioned in SAL (#wikimedia-operations) [2021-11-16T10:21:08Z] <ema> A:cp re-enable puppet after successful test on cp402[17] T293879

Change 739227 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] prometheus:ops: add varnishmtail-internal jobs

https://gerrit.wikimedia.org/r/739227

Change 739229 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: remove internal mtail scripts from default instance

https://gerrit.wikimedia.org/r/739229

Change 738949 merged by Ema:

[operations/puppet@production] varnish: move internal mtail scripts to another instance

https://gerrit.wikimedia.org/r/738949

Change 739227 merged by Ema:

[operations/puppet@production] prometheus:ops: add varnishmtail-internal jobs

https://gerrit.wikimedia.org/r/739227

Change 739229 merged by Ema:

[operations/puppet@production] varnish: remove internal mtail scripts from default instance

https://gerrit.wikimedia.org/r/739229

Change 739558 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: fix path to default mtail scripts to remove

https://gerrit.wikimedia.org/r/739558

Change 739558 merged by Ema:

[operations/puppet@production] varnish: fix path to default mtail scripts to remove

https://gerrit.wikimedia.org/r/739558

Change 739560 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: remove code used to clean up old mtail scripts

https://gerrit.wikimedia.org/r/739560

Change 739560 merged by Ema:

[operations/puppet@production] varnish: remove code used to clean up old mtail scripts

https://gerrit.wikimedia.org/r/739560

The situation has improved significantly, we are now processing up to 13K lines per second vs the ~8K plateau from last week:

varnishmtail-lines-processed.png (819×1 px, 186 KB)

We should leave things as they are over the weekend and double-check on Monday if we are still having any overruns, but so far things look stable.

ema lowered the priority of this task from High to Low.Dec 6 2021, 9:10 AM

In the last 24 hours we had just one overrun on 4 nodes:

Dec 05 20:59:55 cp3060 varnishmtail-default[18982]: Log overrun                
Dec 05 12:48:30 cp3052 varnishmtail-default[20809]: Log overrun                
Dec 05 21:34:06 cp3055 varnishmtail-default[18106]: Log overrun                
Dec 05 17:38:36 cp1080 varnishmtail-default[47398]: Log overrun

cp1080 and cp3055 are upload, while cp3052 and cp3060 are text. There's probably some additional room for improvement when it comes to the performance of varnishmedia.mtail (upload-specific), but at this point it seems reasonable to conclude that the bulk of the issue has been solved, and what is left to be done is mostly alerting work to ensure it does not come back.

Change 752153 had a related patch set uploaded (by Ema; author: Ema):

[operations/debs/varnish4@debian-wmf] Release 6.0.9-1wm1

https://gerrit.wikimedia.org/r/752153