Page MenuHomePhabricator

Reboot elasticsearch* and relforge* to apply kernel security updates
Closed, ResolvedPublic3 Estimated Story Points

Description

Context

We need to get a new kernel version out which requires a full reboot.

AC

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:20:14Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:21:03Z] <ryankemper@cumin1001> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:21:10Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:21:16Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:26:21Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:26:29Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:27:00Z] <ryankemper> T280563 urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fbe4bb8a518>: Failed to establish a new connection: [Errno -2] Name or service not known

Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:32:18Z] <ryankemper> T280563 Spotted the issue; forgot to set --without-lvs for relforge reboot

Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:36:52Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:37:00Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563

ryankemper@cumin1001:~$ sudo -i cookbook sre.elasticsearch.rolling-operation relforge "relforge reboot" --reboot --without-lvs --no-wait-for-green --nodes-per-run 2 --start-datetime 2021-04-29T22:33:29 --task-id T280563
START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563
Fetch 2 node(s) from relforge to perform rolling restart on
Exception raised while executing cookbook sre.elasticsearch.rolling-operation:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 159, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 57, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.7/socket.py", line 748, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 114, in perform_request
    response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, **kw)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 343, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 343, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 841, in _validate_conn
    conn.connect()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 301, in connect
    conn = self._new_conn()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 168, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f87e7ad3438>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 437, in get_nodes
    return self._elasticsearch.nodes.info()["nodes"]
  File "/usr/lib/python3/dist-packages/elasticsearch/client/utils.py", line 73, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/usr/lib/python3/dist-packages/elasticsearch/client/nodes.py", line 21, in info
    node_id, metric), params=params)
  File "/usr/lib/python3/dist-packages/elasticsearch/transport.py", line 312, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 123, in perform_request
    raise ConnectionError('N/A', str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f87e7ad3438>: Failed to establish a new connection: [Errno -2] Name or service not known) caused by: NewConnectionError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f87e7ad3438>: Failed to establish a new connection: [Errno -2] Name or service not known)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/elasticsearch/rolling-operation.py", line 146, in run
    rolling_operation
  File "/srv/deployment/spicerack/cookbooks/sre/elasticsearch/__init__.py", line 42, in execute_on_clusters
    nodes = elasticsearch_clusters.get_next_clusters_nodes(start_datetime, nodes_per_run)
  File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 294, in get_next_clusters_nodes
    nodes_group = self._get_nodes_group()
  File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 313, in _get_nodes_group
    for json_node in cluster.get_nodes().values():
  File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 439, in get_nodes
    raise ElasticsearchClusterError("Could not connect to the cluster") from e
spicerack.elasticsearch_cluster.ElasticsearchClusterError: Could not connect to the cluster
END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563

Working on figuring out why this is failing.

Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:44:15Z] <ryankemper> T280563 Bleh, we never moved the new config into spicerack, so it's trying to talk to the old relforge hosts which no longer exist. Will reboot relforge manually and use the cookbook for codfw/eqiad, and circle back later for the spicerack change

Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:46:56Z] <ryankemper> T280563 Current master is relforge1003-relforge-eqiad, will reboot 1004 first then 1003 after

Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:44:15Z] <ryankemper> T280563 Bleh, we never moved the new config into spicerack, so it's trying to talk to the old relforge hosts which no longer exist. Will reboot relforge manually and use the cookbook for codfw/eqiad, and circle back later for the spicerack change

^ The above should say "out of spicerack", not "into spicerack"

Mentioned in SAL (#wikimedia-operations) [2021-04-29T23:05:25Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-04-29T23:06:10Z] <ryankemper> T280563 sudo -i cookbook sre.elasticsearch.rolling-operation codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563 on ryankemper@cumin1001 tmux session elastic_restarts

Mentioned in SAL (#wikimedia-operations) [2021-04-29T23:08:44Z] <ryankemper> T280563 sudo -i cookbook sre.elasticsearch.rolling-operation search_codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563 on ryankemper@cumin1001 tmux session elastic_restarts (amended command)

Mentioned in SAL (#wikimedia-operations) [2021-04-29T23:18:07Z] <ryankemper> T280563 successful reboot of relforge100[3,4]; relforge cluster is back to green status.

Mentioned in SAL (#wikimedia-operations) [2021-04-30T01:08:12Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-04-30T03:45:33Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-04-30T03:47:48Z] <ryankemper> T280563 about half of codfw nodes have been rebooted before the failure caused by write queue not emptying fast enough, kicking it off again:sudo -i cookbook sre.elasticsearch.rolling-operation search_codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563 on ryankemper@cumin1001 tmux session elastic_restarts

Mentioned in SAL (#wikimedia-operations) [2021-04-30T04:41:04Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-04-30T04:43:06Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-04-30T04:43:20Z] <ryankemper> T280563 sudo -i cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad reboot to apply sec updates" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563 on ryankemper@cumin1001 tmux session elastic_restarts

Mentioned in SAL (#wikimedia-operations) [2021-04-30T05:51:38Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563

MPhamWMF set the point value for this task to 3.May 3 2021, 3:37 PM

Mentioned in SAL (#wikimedia-operations) [2021-05-03T16:30:11Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-03T17:44:34Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-03T21:46:00Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-03T21:47:25Z] <ryankemper> T280563 sudo -i cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad reboot to apply sec updates" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-03T21:52:43Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-03T21:54:49Z] <ryankemper> T280563 eqiad reboot failed with: curator.exceptions.FailedExecution: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectionPool(host='search.svc.eqiad.wmnet', port=9243): Read timed out. (read timeout=10))

Mentioned in SAL (#wikimedia-operations) [2021-05-03T21:55:02Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-03T21:56:04Z] <ryankemper> T280563 sudo -i cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad reboot to apply sec updates" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-04T01:41:18Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-04T03:35:40Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-04T03:36:22Z] <ryankemper> T280563 sudo -i cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad reboot to apply sec updates" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-04T03:38:09Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-04T03:38:39Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-04T04:06:40Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-05T01:36:01Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-05T02:59:51Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-05T03:52:48Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-05T03:56:18Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-05T03:56:54Z] <ryankemper> T280563 Reboot of eqiad complete. Only ~half of codfw is remaining.

Mentioned in SAL (#wikimedia-operations) [2021-05-05T03:58:00Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-05T03:58:14Z] <ryankemper> T280563 sudo -i cookbook sre.elasticsearch.rolling-operation search_codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563 on ryankemper@cumin1001 tmux session elastic_restarts

Mentioned in SAL (#wikimedia-operations) [2021-05-05T07:53:19Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-06T02:55:54Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-06T02:56:17Z] <ryankemper@cumin1001> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-06T15:33:11Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-06T15:40:40Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-12T19:00:44Z] <ryankemper@cumin2001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin2001 - T280563

Mentioned in SAL (#wikimedia-operations) [2021-05-12T19:00:56Z] <ryankemper> T280563 sudo -i cookbook sre.elasticsearch.rolling-operation search_codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563 on ryankemper@cumin2001 tmux session elastic_restarts

Mentioned in SAL (#wikimedia-operations) [2021-05-12T19:07:10Z] <ryankemper@cumin2001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin2001 - T280563

Reboots are complete across all relevant cirrus clusters