⚓ T284471 Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet

Subject	Repo	Branch	Lines +/-
sre.hosts.reimage: handle switches without virtual chassis	operations/cookbooks	wmcs	+6 -1
sre.hosts.reimage: handle switches without virtual chassis	operations/cookbooks	master	+6 -1
Adding mac addresses to dhcpd file cloudcephosd hosts	operations/puppet	production	+20 -0

RobH renamed this task from (Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet to Q1:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet.Aug 26 2021, 7:46 PM

wiki_willy renamed this task from Q1:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet to Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet.Aug 26 2021, 10:29 PM

cloudcephosd1021 C8 u31. port 0/1 cableid 11034/11032 cloudsw2-c8-eqiad
cloudcephosd1022 C8 u32. port 2/3 cableid 11033/11031 cloudsw2-c8-eqiad
cloudcephosd1023 D5 u31. port 0/1 cableid 11038/11041 cloudsw2-d5-eqiad
cloudcephosd1024 D5 u32. port 2/3 cableid 11040/11039. cloudsw2-d5-eqiad

Jclark-ctr reassigned this task from Jclark-ctr to • Cmjohnson.Aug 31 2021, 8:18 PM

Jclark-ctr updated the task description. (Show Details)

Jclark-ctr subscribed.

Updated the network ports with vlans for both NICs

Change 716530 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding mac addresses to dhcpd file cloudcephosd hosts

https://gerrit.wikimedia.org/r/716530

gerritbot added a project: Patch-For-Review.Sep 2 2021, 7:40 PM

Change 716530 merged by Cmjohnson:

[operations/puppet@production] Adding mac addresses to dhcpd file cloudcephosd hosts

https://gerrit.wikimedia.org/r/716530

Maintenance_bot removed a project: Patch-For-Review.Sep 2 2021, 8:11 PM

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1021.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109071403_cmjohnson_28311_cloudcephosd1021_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1022.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109071407_cmjohnson_2001_cloudcephosd1022_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1023.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109071408_cmjohnson_4357_cloudcephosd1023_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1024.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109071408_cmjohnson_5867_cloudcephosd1024_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1024.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['cloudcephosd1023.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['cloudcephosd1021.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1021.eqiad.wmnet']

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1021.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109071438_cmjohnson_12517_cloudcephosd1021_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1022.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1022.eqiad.wmnet']

Completed auto-reimage of hosts:

['cloudcephosd1021.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1021.eqiad.wmnet']

• Cmjohnson updated the task description. (Show Details)Sep 7 2021, 6:48 PM

cloudcephosd1023 and 1024 installed and are set to staged. 1021 and 1022 both in C8 get stuck during the partitioning phase of the install. I need to check the disks are set to non-raid.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1021.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109091550_cmjohnson_19157_cloudcephosd1021_eqiad_wmnet.log.

cloudcephosd1021 and 1022 disks are correct in non-raid. I did fix bios setting for 1021, it was set to continuously boot to the NIC. Installing again but is stuck or really slow during the "starting up partitioner". Stuck at 36%. I have not experienced this before. @RobH or @Papaul have you?

Screen Shot 2021-09-09 at 11.58.36 AM.png (404×554 px, 20 KB)

Completed auto-reimage of hosts:

['cloudcephosd1021.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1021.eqiad.wmnet']

I've doublechecked the settings on cloudcephosd1021 and also updated its firmware (newer releases in the last couple weeks) all to no avail, it is still quite slow when loading the disk detection and also the disk partitioning, and it shouldnt be that slow!

I'm now running the dell test suite to see if any other errors are detected. There was an error event on the 7th, so I downloaded the SEL, then cleared it for full testing suite.

sel.csv396 BDownload

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1021.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109141451_cmjohnson_8680_cloudcephosd1021_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1021.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1021.eqiad.wmnet']

I've added these to cloudsw1

cloudcephosd1021 is in ports 3 and 4 and not pxe booting.
cloudcephosd1022 is in ports 34 and 25 but has not been set up on the switch yet until I can get 1021 to install correctly.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1022.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109171645_cmjohnson_19912_cloudcephosd1022_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1021.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109171652_cmjohnson_20350_cloudcephosd1021_eqiad_wmnet.log.

RobH unsubscribed.Sep 17 2021, 5:07 PM

Completed auto-reimage of hosts:

['cloudcephosd1022.eqiad.wmnet']

and were ALL successful.

Moved both of the servers to cloudsw1. cloudcephosd1022 was installed with zero issues and is now set to staged. Cloudcephosd1021 is still having issues during the installation at the partitioner.

Completed auto-reimage of hosts:

['cloudcephosd1021.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1021.eqiad.wmnet']

@Papaul or @RobH I've hit a roadblock with cloudcephosd1021. The install is failing at 45% in the partitioner. I am not seeing any failed disks. I thought maybe it was a network thing so I moved the server to cloudsw1 along with cloudcephosd1022. 1022 was installed and is staged in netbox but I am still running into the same issue with 1021. I checked the raid configuration, bios settings, and f/w appears to be up to date. Could you look at it with a fresh pespective?

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1022.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109221642_cmjohnson_23416_cloudcephosd1022_eqiad_wmnet.log.

@Papaul @RobH I moved the disks from cloudcephosd1022 to cloudcephosd1021 and attempted a re-install. That did not work, the partitioner hung up at 45% still. The next step is to try and the raid controller.

Completed auto-reimage of hosts:

['cloudcephosd1022.eqiad.wmnet']

and were ALL successful.

Attempting to swap the raid controller today, took the controller from 1022 and put it in 1021. Fingers crossed this is the issue so we can go to Dell with something specific.

Good news, swapped the raid controller and 1021 went through the full installation without an issue, I put the raid controller from 1021 into 1022 and the server stalled during post at initializing firmware. Put it all back together, 1022 is back up and 1021 is now stalling at initializing firmware.

RobH unsubscribed.Oct 7 2021, 2:11 PM

@Cmjohnson Hi! Any updates?

a ticket has been created with Dell, I entered a lot of explanation and troubleshooting in the ticket so hopefully, they will not push back. You have successfully submitted request SR1073334762.

@dcaro I was on vacation last week. If the part ships today, will hopefully have it ready for you tomorrow. Latest Monday.

@Cmjohnson Thanks for the update!

The part shipped, hopefully arrives today

New raid card has been installed @Cmjohnson

we have a new install script and it's not working for this server. This is the error I am getting

cmjohnson@cumin1001:~$ sudo cookbook sre.hosts.reimage --os bullseye --new -t T289888 cloudcephosd1021
Management Password:
Exception raised while initializing the Cookbook sre.hosts.reimage:
Traceback (most recent call last):

File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 219, in run
  runner = self.instance.get_runner(args)
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 84, in get_runner
  return ReimageRunner(args, self.spicerack)
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 152, in __init__
  self.dhcp_config = self._get_dhcp_config()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 254, in _get_dhcp_config
  switch_hostname=switch_iface.device.virtual_chassis.name.split('.')[0],

AttributeError: 'NoneType' object has no attribute 'name'

Change 734571 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/cookbooks@master] sre.hosts.reimage: handle switches without virtual chassis

https://gerrit.wikimedia.org/r/734571

gerritbot added a project: Patch-For-Review.Oct 26 2021, 9:42 AM

Sent a patch to fix the issue (the new script was expecting the switch to have a virtual chassis).

Change 734571 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.reimage: handle switches without virtual chassis

https://gerrit.wikimedia.org/r/734571

Maintenance_bot removed a project: Patch-For-Review.Oct 26 2021, 4:12 PM

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with errors:

cloudcephosd1021 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

From the logs the cookbook was not able to find a reboot after the host was up in the Debian Installer environment. That usually means that the host got stuck in Debian Installer probably asking the user to answer a question or because of invalid/incompatible preseed or partman configurations.
And indeed connecting to the remote console it's on a Debian Installer screen that failed to setup the disks.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye

thanks, @Volans for reminding me that I had to redo the raid configuration with the new controller.

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with errors:

cloudcephosd1021 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with errors:

cloudcephosd1021 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- The reimage failed, see the cookbook logs for the details

The BIOS changed to continuously boot from the NIC, failing the installation. Fixed BIOS and attempting the install again

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with errors:

cloudcephosd1021 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with errors:

cloudcephosd1021 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with errors:

cloudcephosd1021 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with errors:

cloudcephosd1021 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye completed:

cloudcephosd1021 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110281533_cmjohnson_29391_cloudcephosd1021.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged

finally done!

\o/

Papaul unsubscribed.Oct 28 2021, 5:11 PM

Change 736575 had a related patch set uploaded (by Andrew Bogott; author: David Caro):

[operations/cookbooks@wmcs] sre.hosts.reimage: handle switches without virtual chassis

https://gerrit.wikimedia.org/r/736575

Change 736575 abandoned by Andrew Bogott:

[operations/cookbooks@wmcs] sre.hosts.reimage: handle switches without virtual chassis

Reason:

git-review mishap

https://gerrit.wikimedia.org/r/736575

Maintenance_bot removed a project: Patch-For-Review.Nov 3 2021, 10:10 PM

note: for cloudcephosd1022 in Netbox, manually updated Status from Planned to Staged.

aborrero closed subtask T295012: cloud ceph: include cloudcephosd102[1-4].eqiad.wmnet in the farm as Resolved.Nov 5 2021, 10:16 AM

aborrero reopened subtask T295012: cloud ceph: include cloudcephosd102[1-4].eqiad.wmnet in the farm as Open.Nov 5 2021, 11:11 AM

aborrero closed subtask T295012: cloud ceph: include cloudcephosd102[1-4].eqiad.wmnet in the farm as Resolved.Nov 5 2021, 11:33 AM

dcaro mentioned this in rCCKB7d5190dfec59: sre.hosts.reimage: handle switches without virtual chassis.Dec 14 2022, 3:29 PM

Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet
Closed, ResolvedPublic
Actions

Description

Hostname / Racking / Installation Details

Per host setup checklist

Details

Related Objects
Search...

Event Timeline

	F34638330: Screen Shot 2021-09-09 at 11.58.36 AM.png
	Sep 9 2021, 4:00 PM

Status	Assigned	Task
		Unknown Object (Task)
Resolved	• Cmjohnson	T284471 Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet
Resolved	aborrero	T295012 cloud ceph: include cloudcephosd102[1-4].eqiad.wmnet in the farm
Resolved	dcaro	T296175 cloudcephosd1021 is using an old ceph version because its running debian bullseye instead of buster

	RobH
	Jun 7 2021, 4:27 PM

	F34638496: sel.csv
	Sep 9 2021, 6:55 PM

Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnetClosed, ResolvedPublicActions

Description

Hostname / Racking / Installation Details

Per host setup checklist

Details

Related ObjectsSearch...

Event Timeline

Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet
Closed, ResolvedPublic
Actions

Related Objects
Search...