NFS on dataset1001 overloaded, high load on the hosts that mount it
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Volans
	Jul 4 2017, 9:38 PM

Description

Starting at around 15:00 UTC today, the loadavg on dataset1001 raised to around 10, maxing the available 8 CPUs.
From dmesg seems that the NFS process was blocked shortly after 15UTC, but it didn't throw this error later on. The time coincide with a spike in load, see Grafana.

[Jul 4 15:07] INFO: task nfsd:1764 blocked for more than 120 seconds.
[  +0.019962]       Tainted: G          I E   4.9.0-0.bpo.2-amd64 #1
[  +0.019000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.023942] nfsd            D    0  1764      2 0x00000000
[  +0.000004]  ffff916bdabdfc00 0000000000000000 ffff916bd7f28dc0 ffff916bd7e2c380
[  +0.000002]  ffff916befc98700 ffffafe781ffb9f0 ffffffff949f668d ffff916bd7fa1180
[  +0.000001]  ffff916bd7e2c380 0000000000000000 ffffafe781ffbc08 ffff916bd7e2c380
[  +0.000002] Call Trace:
[  +0.000008]  [<ffffffff949f668d>] ? __schedule+0x23d/0x6d0
[  +0.000002]  [<ffffffff949f6b52>] ? schedule+0x32/0x80
[  +0.000001]  [<ffffffff949f9c69>] ? rwsem_down_write_failed+0x229/0x370
[  +0.000005]  [<ffffffff94738683>] ? call_rwsem_down_write_failed+0x13/0x20
[  +0.000001]  [<ffffffff949f8e19>] ? down_write+0x29/0x40
[  +0.000063]  [<ffffffffc0805bed>] ? xfs_file_buffered_aio_write+0x5d/0x270 [xfs]
[  +0.000004]  [<ffffffff94689cc4>] ? find_acceptable_alias+0x24/0x100
[  +0.000024]  [<ffffffffc0805f0b>] ? xfs_file_write_iter+0x10b/0x150 [xfs]
[  +0.000004]  [<ffffffff94602cb0>] ? do_iter_readv_writev+0xb0/0x130
[  +0.000002]  [<ffffffff9460370c>] ? do_readv_writev+0x1ac/0x240
[  +0.000024]  [<ffffffffc0805e00>] ? xfs_file_buffered_aio_write+0x270/0x270 [xfs]
[  +0.000002]  [<ffffffff946006c8>] ? do_dentry_open+0x238/0x2e0
[  +0.000015]  [<ffffffffc099fe77>] ? nfsd_vfs_write+0xd7/0x3c0 [nfsd]
[  +0.000006]  [<ffffffffc09a2124>] ? nfsd_write+0x124/0x310 [nfsd]
[  +0.000007]  [<ffffffffc09a8078>] ? nfsd3_proc_write+0xb8/0x150 [nfsd]
[  +0.000005]  [<ffffffffc099b2bd>] ? nfsd_dispatch+0xad/0x1f0 [nfsd]
[  +0.000028]  [<ffffffffc0691238>] ? svc_process_common+0x448/0x670 [sunrpc]
[  +0.000009]  [<ffffffffc06922d4>] ? svc_process+0xf4/0x1a0 [sunrpc]
[  +0.000006]  [<ffffffffc099ad29>] ? nfsd+0xe9/0x160 [nfsd]
[  +0.000005]  [<ffffffffc099ac40>] ? nfsd_destroy+0x60/0x60 [nfsd]
[  +0.000002]  [<ffffffff944974c0>] ? kthread+0xe0/0x100
[  +0.000002]  [<ffffffff9442476b>] ? __switch_to+0x2bb/0x700
[  +0.000001]  [<ffffffff944973e0>] ? kthread_park+0x60/0x60
[  +0.000002]  [<ffffffff949fb675>] ? ret_from_fork+0x25/0x30
[  +0.000002] INFO: task nfsd:1765 blocked for more than 120 seconds.
...

As a consequence, the hosts that mount it's NFS export stat100[2-3],snapshot100[1,5-7] are showing a skyrocketing loadavg that is reaching 1000, although is a fake load given the unresponsive NFS mountpoint.
Those hosts are either logging things like:

nfs: server dataset1001.wikimedia.org not responding, still trying

or they have processes blocked:

INFO: task php5:6093 blocked for more than 120 seconds

The only alarm that we got is Puppet not running since 6h on the stat100[2-3],snapshot100[1,5-7] hosts, no alarm was triggered for dataset1001 itself.

After a quick chat on IRC and the absence of anyone around that knows well the dumps process it was decided to leave it as is and not attempt a restart of the NFS server to avoid additional disruption of the ongoing dumps/rsyncs/etc. Also because no page was triggered from Icinga, just IRC notifications.

I'm ack'ing the related alarms on Icinga with a reference to this task.

Details

	Subject	Repo	Branch	Lines +/-
	dumps: monitor generation nfs server hosts for nfsd cpu usage	operations/puppet	production	+12 -1
	increase min_free_kbytes on dataset hosts permanently	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T178893: Move xml/sql dumps to dumpsdata1001 from dataset1001
T178048: Investigate and reduce resource use by rsync of dumps between peers, labs, mirrors
T162688: cirrus dump process generates too many log files
T169486: 2017-07-03 Save Timing spike (300% increase)
T166244: Reload WDQS data after T131960 is merged
Mentioned Here: rOPUPd537148c75f8: base: export puppet agent stats to prometheus

Event Timeline

Volans created this task.Jul 4 2017, 9:38 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 4 2017, 9:38 PM

Volans triaged this task as High priority.Jul 4 2017, 9:38 PM

Mentioned in SAL (#wikimedia-operations) [2017-07-04T21:40:50Z] <volans> ACK'ed puppet not running on stat100[2-3],snapshot100[1,5-7] due to NFS overloaded on dataset1001 - T169680

It shows that it using 4.9.0-0.bpo.2-amd64 labs found problems with that kernel. downgrading to the default one in jessie fixed it. (also they had high load with 4.9)

MoritzMuehlenhoff subscribed.Jul 4 2017, 9:47 PM

The dump processes were all hung for hours. I did nfs-kernel-service restart on dataset1001 and shot the datasets processes on snapshots5,6,7 but they are still hung. Load averages remain the same. Might have to reboot some boxes in order to clear this up. I'll look at this tomorrow morning, if others have ideas of things to try in the meantime, feel free.

You could try to force unmount the NFS partition from the hosts that mount it, it should abort the outstanding I/O they are waiting for, also processes that are in TASK_KILLABLE state should be killable.

@fgiunchedi FYI Doing some cleanup with @ArielGlenn we discovered that most of the load was generated by the prometheus-node-exporter:

3007 ?        Dsl   44:58 /usr/bin/prometheus-node-exporter -collector.diskstats.ignored-devices=^(ram|loop|fd)\\d+$ -collector.textfile.directory=/var/lib/prometheus/node.d -collectors.enabled=diskstats,filefd,filesystem,hwmon,loadavg,mdadm,meminfo,netdev,netstat,sockstat,stat,tcpstat,textfile,time,uname

Also is VIRT memory usage was very high (~40G), so I guess it was spawning new threads for each retry or something similar, you might want to have a look.

elukey subscribed.Jul 5 2017, 9:01 AM

Time for a summary of what's been going on. With help from volans, akosiaris, moritzm, tried various things like unmounting the filesystem after shooting any processes that would die. The php sessionclean which runs out of cron had a lot of processes going. Those were killable, but dump processes were not. Umount -l or umount -f just hung. Finally went with fusermount -uz /mnt/data but while it removed the entry from the mount table on the clients, dataset1001 load was still around 10.

Rmmod on dataset100 was going to involve 6 or 7 different modules so we finally just rebooted it. After it came back up and I re-enabled puppet, I first rebooted snapshot1001 and then 5,6,7 after a long delay (had to power cycle them eventually). I enabled puppet on the snapshot hosts. It failed on 5,6,7 saying that umount of /mnt/data failed. Same load and blocking issue as earlier.

The new incident was at 10:54 UTC.

Syslogs show the following on dataset1001:

Jul  5 10:23:52 dataset1001 rpc.mountd[1764]: authenticated mount request from 10.64.48.141:781 for /data (/data)

That's snapshot1007. Snapshot1007 was finally powercycled at 11:13:34, so it could have issued such a request, but puppet was disabled and the filesystem previously unmounted, so how?

Jul  5 10:34:52 dataset1001 rpc.mountd[1764]: authenticated mount request from 10.64.0.120:1013 for /data (/data)

That's snapshot1001.

First puppet run after enabling started with

Jul  5 10:23:22 dataset1001 puppet-agent[2145]: Retrieving pluginfacts

ending with

Jul  5 10:23:54 dataset1001 puppet-agent[2145]: Finished catalog run in 16.54 seconds

and I saw nothing useful there.

Second (cron) run ended with

Jul  5 10:29:18 dataset1001 puppet-agent[4028]: Finished catalog run in 12.20 seconds

and was even less eventful.

Here's the part of the syslog around the event:

Jul  5 10:37:16 dataset1001 systemd[1]: Starting Cleanup of Temporary Directories...
Jul  5 10:37:16 dataset1001 systemd[1]: Started Cleanup of Temporary Directories.
Jul  5 10:38:37 dataset1001 sm-notify[1585]: Unable to notify snapshot1007.eqiad.wmnet, giving up
Jul  5 10:39:03 dataset1001 systemd-timesyncd[1459]: interval/delta/delay/jitter/drift 1024s/-0.006s/0.036s/0.006s/+20ppm
Jul  5 10:41:01 dataset1001 CRON[7732]: (datasets) CMD (bash -c '/usr/bin/rsync -rt --delete --chmod=go-w stat1002.eqiad.wmnet::hdfs-archive/mediacount
s/ /data/xmldatadumps/public/other/mediacounts/')
Jul  5 10:41:01 dataset1001 CRON[7731]: (datasets) CMD (/usr/bin/rsync -rt --delete --chmod=go-w stat1002.eqiad.wmnet::hdfs-archive/mediacounts/ /data/
xmldatadumps/public/other/mediacounts/)
Jul  5 10:45:01 dataset1001 CRON[8520]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul  5 10:51:01 dataset1001 CRON[9684]: (datasets) CMD (bash -c '/usr/bin/rsync -rt --delete --chmod=go-w stat1002.eqiad.wmnet::hdfs-archive/{pageview,
projectview}/legacy/hourly/ /data/xmldatadumps/public/other/pageviews/')
Jul  5 10:53:27 dataset1001 kernel: [ 1879.131688] perf: interrupt took too long (2501 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
Jul  5 10:55:01 dataset1001 CRON[10470]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul  5 10:56:07 dataset1001 systemd-timesyncd[1459]: interval/delta/delay/jitter/drift 2048s/-0.008s/0.036s/0.008s/+18ppm
Jul  5 10:56:24 dataset1001 kernel: [ 2055.251344] INFO: task nfsd:1747 blocked for more than 120 seconds.
Jul  5 10:56:24 dataset1001 kernel: [ 2055.257679]       Tainted: G          I     4.9.0-0.bpo.3-amd64 #1
Jul  5 10:56:24 dataset1001 kernel: [ 2055.264036] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  5 10:56:24 dataset1001 kernel: [ 2055.272217] nfsd            D    0  1747      2 0x00000000
Jul  5 10:56:24 dataset1001 kernel: [ 2055.272221]  ffff93561a79a800 0000000000000000 ffff93561d02b100 ffff935617584fc0
Jul  5 10:56:24 dataset1001 kernel: [ 2055.272223]  ffff93562fc587c0 ffffa23c021cf9f0 ffffffffac40089d 0000000000000000
Jul  5 10:56:24 dataset1001 kernel: [ 2055.272224]  ffff93562fc992c0 000000002fc592c0 ffff935617584fc0 ffff935617584fc0
Jul  5 10:56:24 dataset1001 kernel: [ 2055.272226] Call Trace:
Jul  5 10:56:24 dataset1001 kernel: [ 2055.272232]  [<ffffffffac40089d>] ? __schedule+0x23d/0x6d0
Jul  5 10:56:24 dataset1001 kernel: [ 2055.272234]  [<ffffffffac400d62>] ? schedule+0x32/0x80
Jul  5 10:56:24 dataset1001 kernel: [ 2055.272236]  [<ffffffffac403e79>] ? rwsem_down_write_failed+0x229/0x370
Jul  5 10:56:24 dataset1001 kernel: [ 2055.272238]  [<ffffffffac12fa66>] ? __radix_tree_lookup+0x76/0xe0
Jul  5 10:56:24 dataset1001 kernel: [ 2055.272240]  [<ffffffffac138593>] ? call_rwsem_down_write_failed+0x13/0x20

On the snapshot hosts for that time frame we have:

snapshot1001 aftter manual puppet run, the first run from cron gave:

Jul  5 10:56:01 snapshot1001 CRON[7361]: (root) CMD (/usr/local/sbin/puppet-run > /dev/null 2>&1)
Jul  5 10:56:10 snapshot1001 puppet-agent[7461]: Sleeping for 55 seconds (splay is enabled)
Jul  5 10:56:43 snapshot1001 kernel: [ 1334.959403] nfs: server dataset1001.wikimedia.org not responding, still trying
Jul  5 10:57:03 snapshot1001 kernel: [ 1354.853544] nfs: server dataset1001.wikimedia.org not responding, still trying
Jul  5 10:57:05 snapshot1001 puppet-agent[7461]: Retrieving pluginfacts

There's nothing of interest in the syslogs on 5,6,7 around the time of the lockup.

Current kernel on dataset1001 is now 4.9.25-1~bpo8+3

Forgot to say, looking at the atop logs for around the time of the second incident shows nothing exceptional either.

Faidon adjusted /proc/sys/vm/min_free_kbytes on dataset1001 from 262144 to 2097152

I've re-enabled puppet on snapshot1001 and snapshot1005, restarted the promethus node exporter over there, and we're watching to see what happens.

Also re-enabled puppet on dataset1001.

Nemo_bis subscribed.Jul 5 2017, 2:02 PM

And finally, re-enabled puppet on stat1003, though puppet is currently broken over there due to d537148c75f8ad2d2eb15f99374fd843ef968d54 (requires python-yaml which is already required via /etc/puppet/modules/statistics/manifests/packages.pp) so I mounted /mnt/data manually.

Smalyshev mentioned this in T166244: Reload WDQS data after T131960 is merged.Jul 5 2017, 6:21 PM

Smalyshev subscribed.

hoo subscribed.Jul 5 2017, 8:13 PM

So far dataset1001 still behaving, dump run beginning to pick back up. It's now late enough that I need to get some sleep. Hopefully everything will remain quiet.

These numbers aren't awesome but I don't know what they've been in the past:

root@dataset1001:~# cat /sys/kernel/debug/extfrag/unusable_index
Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.001 0.009 0.017 0.033 0.033 0.098 0.227 
Node 0, zone    DMA32 0.000 0.005 0.052 0.125 0.282 0.532 0.715 0.841 0.928 0.983 0.993 
Node 0, zone   Normal 0.000 0.003 0.092 0.212 0.300 0.499 0.670 0.805 0.891 0.931 0.953

If we do have issues again, I will recommend we move the rsyncs to external mirrors and to labstore, off to ms1001, so that only limited rsyncs (pageviews, kiwix, mediacounts, etc) are done from dataset1001, and see if that helps ease the pressure some.

Krinkle mentioned this in T169486: 2017-07-03 Save Timing spike (300% increase).Jul 6 2017, 5:27 AM

ArielGlenn mentioned this in T162688: cirrus dump process generates too many log files.Jul 6 2017, 6:29 AM

As volans rightly points out, we should be alerting for nfs lockups of this sort on the dataset servers. First take on a patchset here: https://gerrit.wikimedia.org/r/#/c/363548/

ArielGlenn moved this task from Backlog to Active on the Datasets-General-or-Unknown board.Jul 6 2017, 8:35 AM

Change 363574 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] increase min_free_kbytes on dataset hosts permanently

https://gerrit.wikimedia.org/r/363574

gerritbot added a project: Patch-For-Review.Jul 6 2017, 11:05 AM

Change 363574 merged by ArielGlenn:
[operations/puppet@production] increase min_free_kbytes on dataset hosts permanently

https://gerrit.wikimedia.org/r/363574

We had a re-occurrence of the same, with a very similar stack trace and the same consequences:

[Oct 7 23:27] INFO: task nfsd:16015 blocked for more than 120 seconds.
[  +0.006622]       Tainted: G          I     4.9.0-0.bpo.3-amd64 #1
[  +0.006354] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.007989] nfsd            D    0 16015      2 0x00000000
[  +0.000004]  ffff9f016835c800 0000000000000000 ffff9f0300fa02c0 ffff9f01f477af80
[  +0.000002]  ffff9f032fc187c0 ffffab9a8322f9f0 ffffffffaa40089d 0000000000000000
[  +0.000001]  ffff9f032fdd92c0 000000002fc192c0 ffff9f01f477af80 ffff9f01f477af80
[  +0.000002] Call Trace:
[  +0.000006]  [<ffffffffaa40089d>] ? __schedule+0x23d/0x6d0
[  +0.000002]  [<ffffffffaa400d62>] ? schedule+0x32/0x80
[  +0.000002]  [<ffffffffaa403e79>] ? rwsem_down_write_failed+0x229/0x370
[  +0.000002]  [<ffffffffaa12fa66>] ? __radix_tree_lookup+0x76/0xe0
[  +0.000002]  [<ffffffffaa138593>] ? call_rwsem_down_write_failed+0x13/0x20
[  +0.000002]  [<ffffffffaa403029>] ? down_write+0x29/0x40
[  +0.000056]  [<ffffffffc0856192>] ? xfs_file_buffered_aio_write+0x72/0x2c0 [xfs]
[  +0.000024]  [<ffffffffc08564eb>] ? xfs_file_write_iter+0x10b/0x150 [xfs]
[  +0.000004]  [<ffffffffaa002d60>] ? do_iter_readv_writev+0xb0/0x130
[  +0.000001]  [<ffffffffaa0037bc>] ? do_readv_writev+0x1ac/0x240
[  +0.000024]  [<ffffffffc08563e0>] ? xfs_file_buffered_aio_write+0x2c0/0x2c0 [xfs]
[  +0.000002]  [<ffffffffaa000766>] ? do_dentry_open+0x246/0x300
[  +0.000012]  [<ffffffffc09b1f17>] ? nfsd_vfs_write+0xd7/0x3c0 [nfsd]
[  +0.000005]  [<ffffffffc09b1a38>] ? nfsd_open+0x108/0x1f0 [nfsd]
[  +0.000006]  [<ffffffffc09b4254>] ? nfsd_write+0x124/0x310 [nfsd]
[  +0.000006]  [<ffffffffc09ba1b8>] ? nfsd3_proc_write+0xb8/0x150 [nfsd]
[  +0.000005]  [<ffffffffc09ad2d3>] ? nfsd_dispatch+0xc3/0x250 [nfsd]
[  +0.000020]  [<ffffffffc067c265>] ? svc_process_common+0x475/0x680 [sunrpc]
[  +0.000010]  [<ffffffffc067d2e4>] ? svc_process+0xf4/0x1a0 [sunrpc]
[  +0.000005]  [<ffffffffc09acd29>] ? nfsd+0xe9/0x160 [nfsd]
[  +0.000004]  [<ffffffffc09acc40>] ? nfsd_destroy+0x60/0x60 [nfsd]
[  +0.000003]  [<ffffffffa9e97520>] ? kthread+0xf0/0x110
[  +0.000002]  [<ffffffffa9e2476b>] ? __switch_to+0x2bb/0x700
[  +0.000001]  [<ffffffffa9e97430>] ? kthread_park+0x60/0x60
[  +0.000002]  [<ffffffffaa405875>] ? ret_from_fork+0x25/0x30

Mentioned in SAL (#wikimedia-operations) [2017-10-08T11:28:21Z] <volans> ack-ed puppet not running on stat100[5-6],snapshot100[1,5-7] due to NFSD stuck on dataset1001 - T169680

Mentioned in SAL (#wikimedia-operations) [2017-10-08T21:40:23Z] <hoo> Killed "dumpwikidatardf.sh truthy nt" (wikidata truthy dump) on snapshot1007, got stuck after T169680.

Some info from this morning:

elukey@dataset1001:~$ uptime
 08:43:39 up 95 days, 19:23,  1 user,  load average: 10.56, 10.56, 10.66

elukey@dataset1001:~$ ps auxff | grep nfs
root      1551  0.0  0.0      0     0 ?        S<   Jul05   0:00  \_ [nfsiod]
root     16013  0.0  0.0      0     0 ?        S<   Sep26   0:00  \_ [nfsd4_callbacks]
root     16015  0.1  0.0      0     0 ?        D    Sep26  30:57  \_ [nfsd]
root     16016  0.2  0.0      0     0 ?        D    Sep26  47:26  \_ [nfsd]
root     16017  0.4  0.0      0     0 ?        D    Sep26  84:23  \_ [nfsd]
root     16018  0.9  0.0      0     0 ?        D    Sep26 164:13  \_ [nfsd]
root     16019  1.7  0.0      0     0 ?        D    Sep26 311:32  \_ [nfsd]
root     16020  2.8  0.0      0     0 ?        D    Sep26 514:12  \_ [nfsd]
root     16021 15.9  0.0      0     0 ?        R    Sep26 2889:47  \_ [nfsd]
root     16022  6.8  0.0      0     0 ?        D    Sep26 1238:45  \_ [nfsd]

elukey@dataset1001:~$ sudo cat /sys/kernel/debug/extfrag/unusable_index
Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.001 0.009 0.017 0.033 0.033 0.098 0.227
Node 0, zone    DMA32 0.000 0.118 0.264 0.371 0.459 0.627 0.803 0.910 0.974 1.000 1.000
Node 0, zone   Normal 0.000 0.021 0.204 0.303 0.364 0.520 0.696 0.801 0.872 0.934 0.985

elukey@dataset1001:~$ sudo cat /proc/buddyinfo
Node 0, zone      DMA      1      1      1      0      2      1      1      0      1      1      3
Node 0, zone    DMA32  22334  13787   5056   2218   2067   1038    316     94     19      0      0
Node 0, zone   Normal  13281  12435  15576   4879   6075   3454   1034    353    151     63      9

elukey@dataset1001:~$ sudo slabtop --once -s l
 Active / Total Objects (% used)    : 539292 / 666086 (81.0%)
 Active / Total Slabs (% used)      : 59254 / 59260 (100.0%)
 Active / Total Caches (% used)     : 84 / 137 (61.3%)
 Active / Total Size (% used)       : 223651.99K / 239779.34K (93.3%)
 Minimum / Average / Maximum Object : 0.02K / 0.36K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
271621 268888  98%    0.56K  38803        7    155212K radix_tree_node
100065  99586  99%    0.19K   4765       21     19060K kmalloc-192
  9828   8831  89%    1.00K   2457        4      9828K xfs_inode
 40929  23312  56%    0.19K   1949       21      7796K dentry
 13146  13111  99%    0.57K   1878        7      7512K inode_cache
  5408   4576  84%    1.00K   1352        4      5408K kmalloc-1024
 34242  20328  59%    0.10K    878       39      3512K buffer_head
  5552   4742  85%    0.50K    694        8      2776K kmalloc-512
 22848  22798  99%    0.12K    672       34      2688K kernfs_node_cache
 60636  14600  24%    0.03K    489      124      1956K kmalloc-32
  1362    830  60%    1.05K    454        3      1816K ext4_inode_cache
   444    444 100%    4.00K    444        1      1776K kmalloc-4096
  8600   7305  84%    0.20K    430       20      1720K vm_area_struct
  5760   3348  58%    0.25K    360       16      1440K kmalloc-256
 21440   7086  33%    0.06K    335       64      1340K kmalloc-64
 10528   5504  52%    0.12K    329       32      1316K kmalloc-96
  1644   1428  86%    0.62K    274        6      1096K proc_inode_cache
  2580   1671  64%    0.38K    258       10      1032K mnt_cache
   504    491  97%    2.00K    252        2      1008K kmalloc-2048
  2046   1768  86%    0.67K    186       11      1488K shmem_inode_cache
 10624   7765  73%    0.06K    166       64       664K anon_vma_chain
   480    470  97%    2.05K    160        3      1280K idr_layer_cache
  3875    616  15%    0.16K    155       25       620K sigqueue
  4480   2146  47%    0.12K    140       32       560K kmalloc-node
   278    262  94%    3.31K    139        2      1112K task_struct
  5320   3620  68%    0.07K     95       56       380K anon_vma
   258    235  91%    2.06K     86        3       688K sighand_cache
   166    159  95%    2.00K     83        2       332K TCPv6
   474    456  96%    0.62K     79        6       316K sock_inode_cache
    78     78 100%    8.00K     78        1       624K kmalloc-8192
  1617    433  26%    0.19K     77       21       308K cred_jar
   992    646  65%    0.25K     62       16       248K nf_conntrack
  4316   3836  88%    0.05K     52       83       208K ftrace_event_field
    50     50 100%   64.00K     50        1      3200K kmalloc-65536
   343    230  67%    1.06K     49        7       392K signal_cache
    98     86  87%    1.88K     49        2       196K TCP
  2070   1940  93%    0.09K     45       46       180K trace_event_file
   136    100  73%    1.00K     34        4       136K mm_struct
   896    895  99%    0.14K     32       28       128K ext4_groupinfo_4k
    31     31 100%   16.00K     31        1       496K kmalloc-16384
  1680    940  55%    0.07K     30       56       120K Acpi-Operand
   297    201  67%    0.36K     27       11       108K blkdev_requests
   312     32  10%    0.31K     26       12       104K bio-2
   704    248  35%    0.12K     22       32        88K pid
   273    158  57%    0.29K     21       13        84K request_sock_TCP
    19     19 100%   32.00K     19        1       608K kmalloc-32768
  1683    210  12%    0.04K     17       99        68K ext4_extent_status
   176     94  53%    0.69K     16       11       128K files_cache
   255    136  53%    0.23K     15       17        60K tw_sock_TCPv6
    98     34  34%    0.56K     14        7        56K task_group
  1386   1354  97%    0.04K     14       99        56K Acpi-Namespace
   923    434  47%    0.05K     13       71        52K Acpi-Parse
    52     48  92%    0.88K     13        4        52K RAW
   351    109  31%    0.10K      9       39        36K blkdev_ioc
   168    136  80%    0.19K      8       21        32K kmem_cache

elukey@dataset1001:~$ cat /proc/meminfo
MemTotal:       16423580 kB
MemFree:         2758824 kB
MemAvailable:    9511148 kB
Buffers:           32016 kB
Cached:         12818516 kB
SwapCached:            0 kB
Active:         10931380 kB
Inactive:        2112216 kB
Active(anon):     206080 kB
Inactive(anon):   350588 kB
Active(file):   10725300 kB
Inactive(file):  1761628 kB
Unevictable:       10584 kB
Mlocked:           10584 kB
SwapTotal:        976892 kB
SwapFree:         976892 kB
Dirty:                48 kB
Writeback:             0 kB
AnonPages:        203984 kB
Mapped:            36640 kB
Shmem:            359888 kB
Slab:             251504 kB
SReclaimable:     193040 kB
SUnreclaim:        58464 kB
KernelStack:        3712 kB
PageTables:         8896 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     9188680 kB
Committed_AS:    1301028 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:     1590756 kB
DirectMap2M:    14125056 kB
DirectMap1G:     1048576 kB

Mentioned in SAL (#wikimedia-operations) [2017-10-09T08:56:13Z] <volans> disabled puppet on 'stat100[5-6]*,snapshot100[1,5-7]*,dataset1001*' to manually recover from nfsd stuck - T169680

Mentioned in SAL (#wikimedia-operations) [2017-10-09T09:51:48Z] <volans> rebooting dataset1001 for NFS stuck - T169680

Mentioned in SAL (#wikimedia-operations) [2017-10-09T10:00:03Z] <volans> rebooting snapshot1005 for NFS stuck - T169680

Mentioned in SAL (#wikimedia-operations) [2017-10-09T10:15:45Z] <elukey> rebooting snapshot1001 for NFS stuck - T169680

Mentioned in SAL (#wikimedia-operations) [2017-10-09T10:21:16Z] <volans> rebooting snapshot1006 for NFS stuck - T169680

Mentioned in SAL (#wikimedia-operations) [2017-10-09T10:30:48Z] <volans> rebooting snapshot1007 for NFS stuck - T169680

Mentioned in SAL (#wikimedia-operations) [2017-10-09T10:35:32Z] <hoo> Started dumpwikidatajson.sh on snapshot1007 after T169680 disruptions

ArielGlenn mentioned this in T178048: Investigate and reduce resource use by rsync of dumps between peers, labs, mirrors.Oct 12 2017, 9:10 AM

In order to get dewiki to complete on time (before the 20th), I'm running the 4th part of the rev history content jobs via a script in a screen session on snapshot1006. I'll need to shoot the regular dewiki run once it completes the 3rd part, update a status file, and then look at the 7z job next.

ArielGlenn mentioned this in T178893: Move xml/sql dumps to dumpsdata1001 from dataset1001.Oct 24 2017, 12:58 PM

Dzahn added a project: observability.Nov 17 2017, 2:58 PM

Change 363548 had a related patch set uploaded (by Dzahn; owner: ArielGlenn):
[operations/puppet@production] monitor dataset hosts for nfsd cpu usage

https://gerrit.wikimedia.org/r/363548

Change 363548 had a related patch set uploaded (by Dzahn; owner: ArielGlenn):
[operations/puppet@production] datasets: monitor hosts for nfsd cpu usage

https://gerrit.wikimedia.org/r/363548

No dumps generation jobs run on the dataset host any more. This host only does rsyncs and web service. The new nfs servers do only that, plus (serial) rsyncs to the web servers. The new NFS servers have twice the memory, and are set up as raid 10 instead of raid 6. For these reasons I don't expect to see a repeat of this issue.

The nfs monitoring script would still be nice to have so I'll leave this ticket open until that's complete.

https://gerrit.wikimedia.org/r/#/c/363548/

Change 363548 merged by Dzahn:
[operations/puppet@production] dumps: monitor generation nfs server hosts for nfsd cpu usage

https://gerrit.wikimedia.org/r/363548

merged, ran puppet on dumpsdata1001 and 1002. verified NRPE was refreshed and ok, Icinga config also ok, now these are new in web UI:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=dumpsdata1002&service=nfsd+cpu+usage

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=dumpsdata1001&service=nfsd+cpu+usage

Dzahn closed this task as Resolved.Sep 20 2018, 5:03 PM

Dzahn claimed this task.

Dzahn added a subscriber: • Bstorm.

@Bstorm ^ i heard you might be interested in copying this for other NFS servers. we just added these checks above to Icinga today.

NFS on dataset1001 overloaded, high load on the hosts that mount itClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

NFS on dataset1001 overloaded, high load on the hosts that mount it
Closed, ResolvedPublic
Actions