Context
We need to get a new kernel version out which requires a full reboot.
AC
- Finish cookbook refactor (https://phabricator.wikimedia.org/T277792) before rebooting
- Reboots performed on
- relforge
- codfw
- eqiad
We need to get a new kernel version out which requires a full reboot.
Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:20:14Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:21:03Z] <ryankemper@cumin1001> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:21:10Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:21:16Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:26:21Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:26:29Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:27:00Z] <ryankemper> T280563 urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fbe4bb8a518>: Failed to establish a new connection: [Errno -2] Name or service not known
Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:32:18Z] <ryankemper> T280563 Spotted the issue; forgot to set --without-lvs for relforge reboot
Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:36:52Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:37:00Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563
ryankemper@cumin1001:~$ sudo -i cookbook sre.elasticsearch.rolling-operation relforge "relforge reboot" --reboot --without-lvs --no-wait-for-green --nodes-per-run 2 --start-datetime 2021-04-29T22:33:29 --task-id T280563 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 Fetch 2 node(s) from relforge to perform rolling restart on Exception raised while executing cookbook sre.elasticsearch.rolling-operation: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 159, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw) File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 57, in create_connection for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): File "/usr/lib/python3.7/socket.py", line 748, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags): socket.gaierror: [Errno -2] Name or service not known During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 114, in perform_request response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, **kw) File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 638, in urlopen _stacktrace=sys.exc_info()[2]) File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 343, in increment raise six.reraise(type(error), error, _stacktrace) File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise raise value File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 600, in urlopen chunked=chunked) File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 343, in _make_request self._validate_conn(conn) File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 841, in _validate_conn conn.connect() File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 301, in connect conn = self._new_conn() File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 168, in _new_conn self, "Failed to establish a new connection: %s" % e) urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f87e7ad3438>: Failed to establish a new connection: [Errno -2] Name or service not known During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 437, in get_nodes return self._elasticsearch.nodes.info()["nodes"] File "/usr/lib/python3/dist-packages/elasticsearch/client/utils.py", line 73, in _wrapped return func(*args, params=params, **kwargs) File "/usr/lib/python3/dist-packages/elasticsearch/client/nodes.py", line 21, in info node_id, metric), params=params) File "/usr/lib/python3/dist-packages/elasticsearch/transport.py", line 312, in perform_request status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout) File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 123, in perform_request raise ConnectionError('N/A', str(e), e) elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f87e7ad3438>: Failed to establish a new connection: [Errno -2] Name or service not known) caused by: NewConnectionError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7f87e7ad3438>: Failed to establish a new connection: [Errno -2] Name or service not known) The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run raw_ret = runner.run() File "/srv/deployment/spicerack/cookbooks/sre/elasticsearch/rolling-operation.py", line 146, in run rolling_operation File "/srv/deployment/spicerack/cookbooks/sre/elasticsearch/__init__.py", line 42, in execute_on_clusters nodes = elasticsearch_clusters.get_next_clusters_nodes(start_datetime, nodes_per_run) File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 294, in get_next_clusters_nodes nodes_group = self._get_nodes_group() File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 313, in _get_nodes_group for json_node in cluster.get_nodes().values(): File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 439, in get_nodes raise ElasticsearchClusterError("Could not connect to the cluster") from e spicerack.elasticsearch_cluster.ElasticsearchClusterError: Could not connect to the cluster END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563
Working on figuring out why this is failing.
Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:44:15Z] <ryankemper> T280563 Bleh, we never moved the new config into spicerack, so it's trying to talk to the old relforge hosts which no longer exist. Will reboot relforge manually and use the cookbook for codfw/eqiad, and circle back later for the spicerack change
Mentioned in SAL (#wikimedia-operations) [2021-04-29T22:46:56Z] <ryankemper> T280563 Current master is relforge1003-relforge-eqiad, will reboot 1004 first then 1003 after
Mentioned in SAL (#wikimedia-operations) [2021-04-29T23:05:25Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-04-29T23:06:10Z] <ryankemper> T280563 sudo -i cookbook sre.elasticsearch.rolling-operation codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563 on ryankemper@cumin1001 tmux session elastic_restarts
Mentioned in SAL (#wikimedia-operations) [2021-04-29T23:08:44Z] <ryankemper> T280563 sudo -i cookbook sre.elasticsearch.rolling-operation search_codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563 on ryankemper@cumin1001 tmux session elastic_restarts (amended command)
Mentioned in SAL (#wikimedia-operations) [2021-04-29T23:18:07Z] <ryankemper> T280563 successful reboot of relforge100[3,4]; relforge cluster is back to green status.
Mentioned in SAL (#wikimedia-operations) [2021-04-30T01:08:12Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-04-30T03:45:33Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-04-30T03:47:48Z] <ryankemper> T280563 about half of codfw nodes have been rebooted before the failure caused by write queue not emptying fast enough, kicking it off again:sudo -i cookbook sre.elasticsearch.rolling-operation search_codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563 on ryankemper@cumin1001 tmux session elastic_restarts
Mentioned in SAL (#wikimedia-operations) [2021-04-30T04:41:04Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-04-30T04:43:06Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-04-30T04:43:20Z] <ryankemper> T280563 sudo -i cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad reboot to apply sec updates" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563 on ryankemper@cumin1001 tmux session elastic_restarts
Mentioned in SAL (#wikimedia-operations) [2021-04-30T05:51:38Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-03T16:30:11Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-03T17:44:34Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-03T21:46:00Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-03T21:47:25Z] <ryankemper> T280563 sudo -i cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad reboot to apply sec updates" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-03T21:52:43Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-03T21:54:49Z] <ryankemper> T280563 eqiad reboot failed with: curator.exceptions.FailedExecution: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectionPool(host='search.svc.eqiad.wmnet', port=9243): Read timed out. (read timeout=10))
Mentioned in SAL (#wikimedia-operations) [2021-05-03T21:55:02Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-03T21:56:04Z] <ryankemper> T280563 sudo -i cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad reboot to apply sec updates" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-04T01:41:18Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-04T03:35:40Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-04T03:36:22Z] <ryankemper> T280563 sudo -i cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad reboot to apply sec updates" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-04T03:38:09Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-04T03:38:39Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-04T04:06:40Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-05T01:36:01Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-05T02:59:51Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-05T03:52:48Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-05T03:56:18Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-05T03:56:54Z] <ryankemper> T280563 Reboot of eqiad complete. Only ~half of codfw is remaining.
Mentioned in SAL (#wikimedia-operations) [2021-05-05T03:58:00Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-05T03:58:14Z] <ryankemper> T280563 sudo -i cookbook sre.elasticsearch.rolling-operation search_codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563 on ryankemper@cumin1001 tmux session elastic_restarts
Mentioned in SAL (#wikimedia-operations) [2021-05-05T07:53:19Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-06T02:55:54Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-06T02:56:17Z] <ryankemper@cumin1001> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-06T15:33:11Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-06T15:40:40Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-12T19:00:44Z] <ryankemper@cumin2001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin2001 - T280563
Mentioned in SAL (#wikimedia-operations) [2021-05-12T19:00:56Z] <ryankemper> T280563 sudo -i cookbook sre.elasticsearch.rolling-operation search_codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563 on ryankemper@cumin2001 tmux session elastic_restarts
Mentioned in SAL (#wikimedia-operations) [2021-05-12T19:07:10Z] <ryankemper@cumin2001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin2001 - T280563