Page MenuHomePhabricator

Upgrade relforge cluster to 7.10.2
Closed, ResolvedPublic2 Estimated Story Points

Event Timeline

Change 824555 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: enable ES7.10 in relforge env

https://gerrit.wikimedia.org/r/824555

Change 824555 merged by Bking:

[operations/puppet@production] elastic: enable ES7.10 in relforge env

https://gerrit.wikimedia.org/r/824555

After merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/824555, we got some errors (see Brian's following comment).

As a hack to see if it would get things working we tried copying stuff over, but that didn't work.

https://apt-browser.toolforge.org/bullseye-wikimedia/thirdparty/elastic710/ used to look like this:

Showing packages for bullseye-wikimedia/thirdparty/elastic710/binary-amd64:

    wmf-elasticsearch-search-plugins: 7.10.2-2~bullseye

Source code available under the AGPL v3 or later.

We made it look like this:

Showing packages for bullseye-wikimedia/thirdparty/elastic710/binary-amd64:

    elasticsearch-curator: 5.8.1
    elasticsearch-oss: 7.10.0
    kibana-oss: 7.10.0
    logstash-oss: 1:7.10.0-1
    wmf-elasticsearch-search-plugins: 7.10.2-2~bullseye

Source code available under the AGPL v3 or later.
Errors:
W: Skipping acquire of configured file 'component/elastic710/binary-amd64/Packages' as
 repository 'http://apt.wikimedia.org/wikimedia bullseye-wikimedia InRelease' doesn't
have the component 'component/elastic710' (component misspelt in sources.list?)
W: Skipping acquire of configured file 'component/elastic710/i18n/Translation-en_US' a
s repository 'http://apt.wikimedia.org/wikimedia bullseye-wikimedia InRelease' doesn't
 have the component 'component/elastic710' (component misspelt in sources.list?)
W: Skipping acquire of configured file 'component/elastic710/i18n/Translation-en' as r
epository 'http://apt.wikimedia.org/wikimedia bullseye-wikimedia InRelease' doesn't ha
ve the component 'component/elastic710' (component misspelt in sources.list?)
root@relforge1003:/etc/apt/sources.list.d# ls -lhtra
total 32K
──────────────────────────────────────────────────────────────────────────────────────
W: Skipping acquire of configured file 'thirdparty/elasticsearch-curator5/sour[20/606]
es' as repository 'http://apt.wikimedia.org/wikimedia bullseye-wikimedia InRelease' do
esn't have the component 'thirdparty/elasticsearch-curator5' (component misspelt in so
urces.list?)
W: Skipping acquire of configured file 'thirdparty/elasticsearch-curator5/binary-amd64
/Packages' as repository 'http://apt.wikimedia.org/wikimedia bullseye-wikimedia InRele
ase' doesn't have the component 'thirdparty/elasticsearch-curator5' (component misspel
t in sources.list?)
W: Skipping acquire of configured file 'thirdparty/elasticsearch-curator5/i18n/Transla
tion-en_US' as repository 'http://apt.wikimedia.org/wikimedia bullseye-wikimedia InRel
ease' doesn't have the component 'thirdparty/elasticsearch-curator5' (component misspe
lt in sources.list?)
W: Skipping acquire of configured file 'thirdparty/elasticsearch-curator5/i18n/Transla
tion-en' as repository 'http://apt.wikimedia.org/wikimedia bullseye-wikimedia InReleas
e' doesn't have the

Change 824568 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] bullseye: add thirdparty/elasticsearch-curator5

https://gerrit.wikimedia.org/r/824568

Change 824568 merged by Bking:

[operations/puppet@production] bullseye: add thirdparty/elasticsearch-curator5

https://gerrit.wikimedia.org/r/824568

Change 824791 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] bullseye: apt component update

https://gerrit.wikimedia.org/r/824791

MPhamWMF set the point value for this task to 2.Aug 22 2022, 3:34 PM
MPhamWMF moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.
bking updated Other Assignee, added: RKemper.
RKemper reassigned this task from RKemper to bking.
RKemper updated Other Assignee, removed: RKemper.
RKemper updated Other Assignee, added: RKemper.

Change 824791 merged by Bking:

[operations/puppet@production] bullseye: apt component update

https://gerrit.wikimedia.org/r/824791

Mentioned in SAL (#wikimedia-operations) [2022-08-22T21:17:14Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin2002 - T315604

Mentioned in SAL (#wikimedia-operations) [2022-08-22T21:17:38Z] <bking@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin2002 - T315604

Change 825413 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] apt: changes to pull in latest elastic version

https://gerrit.wikimedia.org/r/825413

Change 825413 merged by Bking:

[operations/puppet@production] apt: changes to pull in latest elastic version

https://gerrit.wikimedia.org/r/825413

Mentioned in SAL (#wikimedia-operations) [2022-08-22T21:45:56Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin2002 - T315604

Mentioned in SAL (#wikimedia-operations) [2022-08-22T21:46:09Z] <bking@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin2002 - T315604

Mentioned in SAL (#wikimedia-operations) [2022-08-22T21:55:43Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin2002 - T315604

Mentioned in SAL (#wikimedia-operations) [2022-08-22T21:56:16Z] <bking@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin2002 - T315604

The cookbook failed on relforge, it looks like the systemd units are not replaced, so it tried to start elasticsearch_6 and elasticsearch_7 services at the same time.

Our custom Elasticsearch unit files are sourced from /usr/lib/systemd/system (not a best practice; custom unit files are supposed to go in /etc/systemd/system ) . We use systemd's templating feature to automatically fill in specific details for each Elasticsearch instance; the file names that activate the templates are found in /etc/systemd/system/multi-user.target.wants , and the specific file names are sourced from /etc/elasticsearch/instances

We'll figure out the next steps tomorrow.

Adding the following to the cookbook:

  • stop service
  • disable es6 units, such as systemctl disable elasticsearch_6@relforge-eqiad-small-alpha.service
  • remove /usr/lib/systemd/system/elasticsearch_6@.service

Mentioned in SAL (#wikimedia-operations) [2022-08-23T19:34:45Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: T315604

Mentioned in SAL (#wikimedia-operations) [2022-08-23T19:34:59Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: T315604

[2022-08-23T19:46:22,669][WARN ][o.e.c.c.ClusterFormationFailureHelper] [relforge1003-relforge-eqiad-small-alpha] master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node: have discovered [{relforge1003-relforge-eqiad-small-alpha}{rrFzJB13TIa9ElcCQ8v6NQ}{vKrW187kSR-4Rsy7wCg5Qg}{10.64.5.37}{10.64.5.37:9500}{dimr}{hostname=relforge1003, rack=A2, fqdn=relforge1003.eqiad.wmnet, row=A}]; discovery will continue using [] from hosts providers and [{relforge1003-relforge-eqiad-small-alpha}{rrFzJB13TIa9ElcCQ8v6NQ}{vKrW187kSR-4Rsy7wCg5Qg}{10.64.5.37}{10.64.5.37:9500}{dimr}{hostname=relforge1003, rack=A2, fqdn=relforge1003.eqiad.wmnet, row=A}] from last-known cluster state; node term 0, last-accepted version 15 in term 0
[2022-08-23T19:46:31,615][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [relforge1003-relforge-eqiad-small-alpha] no known master node, scheduling a retry

Because relforge has only the two masters we'll have to do a special procedure to upgrade it. Basically we can manually set cluster.initial_master_nodes accordingly; based off the error message as well as the docs we just need to set it to relforge1003-relforge-eqiad-small-alpha like so in `/etc/elasticsearch/relforge-eqiad-small-alpha/elasticsearch.yml
:

# Bootstrap the cluster using an initial set of master-eligible nodes:
#
cluster.initial_master_nodes: ["relforge1003-relforge-eqiad-small-alpha"]

Mentioned in SAL (#wikimedia-cloud) [2022-08-29T21:25:32Z] <inflatador> ES6->7 upgrade in beta-cluster T315604