Starting at around 15:00 UTC today, the loadavg on dataset1001 raised to around 10, maxing the available 8 CPUs.
From dmesg seems that the NFS process was blocked shortly after 15UTC, but it didn't throw this error later on. The time coincide with a spike in load, see Grafana.
[Jul 4 15:07] INFO: task nfsd:1764 blocked for more than 120 seconds. [ +0.019962] Tainted: G I E 4.9.0-0.bpo.2-amd64 #1 [ +0.019000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ +0.023942] nfsd D 0 1764 2 0x00000000 [ +0.000004] ffff916bdabdfc00 0000000000000000 ffff916bd7f28dc0 ffff916bd7e2c380 [ +0.000002] ffff916befc98700 ffffafe781ffb9f0 ffffffff949f668d ffff916bd7fa1180 [ +0.000001] ffff916bd7e2c380 0000000000000000 ffffafe781ffbc08 ffff916bd7e2c380 [ +0.000002] Call Trace: [ +0.000008] [<ffffffff949f668d>] ? __schedule+0x23d/0x6d0 [ +0.000002] [<ffffffff949f6b52>] ? schedule+0x32/0x80 [ +0.000001] [<ffffffff949f9c69>] ? rwsem_down_write_failed+0x229/0x370 [ +0.000005] [<ffffffff94738683>] ? call_rwsem_down_write_failed+0x13/0x20 [ +0.000001] [<ffffffff949f8e19>] ? down_write+0x29/0x40 [ +0.000063] [<ffffffffc0805bed>] ? xfs_file_buffered_aio_write+0x5d/0x270 [xfs] [ +0.000004] [<ffffffff94689cc4>] ? find_acceptable_alias+0x24/0x100 [ +0.000024] [<ffffffffc0805f0b>] ? xfs_file_write_iter+0x10b/0x150 [xfs] [ +0.000004] [<ffffffff94602cb0>] ? do_iter_readv_writev+0xb0/0x130 [ +0.000002] [<ffffffff9460370c>] ? do_readv_writev+0x1ac/0x240 [ +0.000024] [<ffffffffc0805e00>] ? xfs_file_buffered_aio_write+0x270/0x270 [xfs] [ +0.000002] [<ffffffff946006c8>] ? do_dentry_open+0x238/0x2e0 [ +0.000015] [<ffffffffc099fe77>] ? nfsd_vfs_write+0xd7/0x3c0 [nfsd] [ +0.000006] [<ffffffffc09a2124>] ? nfsd_write+0x124/0x310 [nfsd] [ +0.000007] [<ffffffffc09a8078>] ? nfsd3_proc_write+0xb8/0x150 [nfsd] [ +0.000005] [<ffffffffc099b2bd>] ? nfsd_dispatch+0xad/0x1f0 [nfsd] [ +0.000028] [<ffffffffc0691238>] ? svc_process_common+0x448/0x670 [sunrpc] [ +0.000009] [<ffffffffc06922d4>] ? svc_process+0xf4/0x1a0 [sunrpc] [ +0.000006] [<ffffffffc099ad29>] ? nfsd+0xe9/0x160 [nfsd] [ +0.000005] [<ffffffffc099ac40>] ? nfsd_destroy+0x60/0x60 [nfsd] [ +0.000002] [<ffffffff944974c0>] ? kthread+0xe0/0x100 [ +0.000002] [<ffffffff9442476b>] ? __switch_to+0x2bb/0x700 [ +0.000001] [<ffffffff944973e0>] ? kthread_park+0x60/0x60 [ +0.000002] [<ffffffff949fb675>] ? ret_from_fork+0x25/0x30 [ +0.000002] INFO: task nfsd:1765 blocked for more than 120 seconds. ...
As a consequence, the hosts that mount it's NFS export stat100[2-3],snapshot100[1,5-7] are showing a skyrocketing loadavg that is reaching 1000, although is a fake load given the unresponsive NFS mountpoint.
Those hosts are either logging things like:
nfs: server dataset1001.wikimedia.org not responding, still trying
or they have processes blocked:
INFO: task php5:6093 blocked for more than 120 seconds
The only alarm that we got is Puppet not running since 6h on the stat100[2-3],snapshot100[1,5-7] hosts, no alarm was triggered for dataset1001 itself.
After a quick chat on IRC and the absence of anyone around that knows well the dumps process it was decided to leave it as is and not attempt a restart of the NFS server to avoid additional disruption of the ongoing dumps/rsyncs/etc. Also because no page was triggered from Icinga, just IRC notifications.
I'm ack'ing the related alarms on Icinga with a reference to this task.