Page MenuHomePhabricator

Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of cloudcephosd102[1-5].eqiad.wmnet

Hostname / Racking / Installation Details

Hostnames: cloudcephosd102[1-5].eqiad.wmnet
Racking Proposal: Use WMCS dedicated racks
Networking/Subnet/VLAN/IP: 10G, (wmcs) cloud vlan.. plug into the cloudsw in the rack; 2 x 10G ports per server (5 x 2 = 10 ports). Each host should have its 1:10G on cloud-hosts1-eqiad and its 2:10G on cloud-storage1-eqiad.
Partitioning/Raid: all disks in non-raid mode on hw controller then sw RAID 10 on OS drive pair, no RAID (JBOD Only) for data drives.
OS Distro: Buster

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudcephosd1021.eqiad.wmnet:

  • - receive in system on procurement task T283888 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

cloudcephosd1022.eqiad.wmnet:

  • - receive in system on procurement task T283888 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

cloudcephosd1023.eqiad.wmnet:

  • - receive in system on procurement task T283888 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

cloudcephosd1024.eqiad.wmnet:

  • - receive in system on procurement task T283888 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH renamed this task from (Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet to Q1:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet.Aug 26 2021, 7:46 PM
wiki_willy renamed this task from Q1:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet to Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet.Aug 26 2021, 10:29 PM

cloudcephosd1021 C8 u31. port 0/1 cableid 11034/11032 cloudsw2-c8-eqiad
cloudcephosd1022 C8 u32. port 2/3 cableid 11033/11031 cloudsw2-c8-eqiad
cloudcephosd1023 D5 u31. port 0/1 cableid 11038/11041 cloudsw2-d5-eqiad
cloudcephosd1024 D5 u32. port 2/3 cableid 11040/11039. cloudsw2-d5-eqiad

Updated the network ports with vlans for both NICs

Change 716530 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding mac addresses to dhcpd file cloudcephosd hosts

https://gerrit.wikimedia.org/r/716530

Change 716530 merged by Cmjohnson:

[operations/puppet@production] Adding mac addresses to dhcpd file cloudcephosd hosts

https://gerrit.wikimedia.org/r/716530

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1021.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109071403_cmjohnson_28311_cloudcephosd1021_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1022.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109071407_cmjohnson_2001_cloudcephosd1022_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1023.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109071408_cmjohnson_4357_cloudcephosd1023_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1024.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109071408_cmjohnson_5867_cloudcephosd1024_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1024.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['cloudcephosd1023.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['cloudcephosd1021.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1021.eqiad.wmnet']

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1021.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109071438_cmjohnson_12517_cloudcephosd1021_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1022.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1022.eqiad.wmnet']

Completed auto-reimage of hosts:

['cloudcephosd1021.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1021.eqiad.wmnet']

cloudcephosd1023 and 1024 installed and are set to staged. 1021 and 1022 both in C8 get stuck during the partitioning phase of the install. I need to check the disks are set to non-raid.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1021.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109091550_cmjohnson_19157_cloudcephosd1021_eqiad_wmnet.log.

cloudcephosd1021 and 1022 disks are correct in non-raid. I did fix bios setting for 1021, it was set to continuously boot to the NIC. Installing again but is stuck or really slow during the "starting up partitioner". Stuck at 36%. I have not experienced this before. @RobH or @Papaul have you?

Screen Shot 2021-09-09 at 11.58.36 AM.png (404×554 px, 20 KB)

Completed auto-reimage of hosts:

['cloudcephosd1021.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1021.eqiad.wmnet']

I've doublechecked the settings on cloudcephosd1021 and also updated its firmware (newer releases in the last couple weeks) all to no avail, it is still quite slow when loading the disk detection and also the disk partitioning, and it shouldnt be that slow!

I'm now running the dell test suite to see if any other errors are detected. There was an error event on the 7th, so I downloaded the SEL, then cleared it for full testing suite.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1021.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109141451_cmjohnson_8680_cloudcephosd1021_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1021.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1021.eqiad.wmnet']

I've added these to cloudsw1

cloudcephosd1021 is in ports 3 and 4 and not pxe booting.
cloudcephosd1022 is in ports 34 and 25 but has not been set up on the switch yet until I can get 1021 to install correctly.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1022.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109171645_cmjohnson_19912_cloudcephosd1022_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1021.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109171652_cmjohnson_20350_cloudcephosd1021_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1022.eqiad.wmnet']

and were ALL successful.

Moved both of the servers to cloudsw1. cloudcephosd1022 was installed with zero issues and is now set to staged. Cloudcephosd1021 is still having issues during the installation at the partitioner.

Completed auto-reimage of hosts:

['cloudcephosd1021.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1021.eqiad.wmnet']

@Papaul or @RobH I've hit a roadblock with cloudcephosd1021. The install is failing at 45% in the partitioner. I am not seeing any failed disks. I thought maybe it was a network thing so I moved the server to cloudsw1 along with cloudcephosd1022. 1022 was installed and is staged in netbox but I am still running into the same issue with 1021. I checked the raid configuration, bios settings, and f/w appears to be up to date. Could you look at it with a fresh pespective?

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1022.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109221642_cmjohnson_23416_cloudcephosd1022_eqiad_wmnet.log.

@Papaul @RobH I moved the disks from cloudcephosd1022 to cloudcephosd1021 and attempted a re-install. That did not work, the partitioner hung up at 45% still. The next step is to try and the raid controller.

Completed auto-reimage of hosts:

['cloudcephosd1022.eqiad.wmnet']

and were ALL successful.

Attempting to swap the raid controller today, took the controller from 1022 and put it in 1021. Fingers crossed this is the issue so we can go to Dell with something specific.

Good news, swapped the raid controller and 1021 went through the full installation without an issue, I put the raid controller from 1021 into 1022 and the server stalled during post at initializing firmware. Put it all back together, 1022 is back up and 1021 is now stalling at initializing firmware.

a ticket has been created with Dell, I entered a lot of explanation and troubleshooting in the ticket so hopefully, they will not push back. You have successfully submitted request SR1073334762.

@dcaro I was on vacation last week. If the part ships today, will hopefully have it ready for you tomorrow. Latest Monday.

The part shipped, hopefully arrives today

we have a new install script and it's not working for this server. This is the error I am getting

cmjohnson@cumin1001:~$ sudo cookbook sre.hosts.reimage --os bullseye --new -t T289888 cloudcephosd1021
Management Password:
Exception raised while initializing the Cookbook sre.hosts.reimage:
Traceback (most recent call last):

File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 219, in run
  runner = self.instance.get_runner(args)
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 84, in get_runner
  return ReimageRunner(args, self.spicerack)
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 152, in __init__
  self.dhcp_config = self._get_dhcp_config()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 254, in _get_dhcp_config
  switch_hostname=switch_iface.device.virtual_chassis.name.split('.')[0],

AttributeError: 'NoneType' object has no attribute 'name'

Change 734571 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/cookbooks@master] sre.hosts.reimage: handle switches without virtual chassis

https://gerrit.wikimedia.org/r/734571

Sent a patch to fix the issue (the new script was expecting the switch to have a virtual chassis).

Change 734571 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.reimage: handle switches without virtual chassis

https://gerrit.wikimedia.org/r/734571

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1021 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

From the logs the cookbook was not able to find a reboot after the host was up in the Debian Installer environment. That usually means that the host got stuck in Debian Installer probably asking the user to answer a question or because of invalid/incompatible preseed or partman configurations.
And indeed connecting to the remote console it's on a Debian Installer screen that failed to setup the disks.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye

thanks, @Volans for reminding me that I had to redo the raid configuration with the new controller.

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1021 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1021 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

The BIOS changed to continuously boot from the NIC, failing the installation. Fixed BIOS and attempting the install again

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1021 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1021 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1021 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1021 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1021 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110281533_cmjohnson_29391_cloudcephosd1021.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
Cmjohnson updated the task description. (Show Details)

finally done!

Change 736575 had a related patch set uploaded (by Andrew Bogott; author: David Caro):

[operations/cookbooks@wmcs] sre.hosts.reimage: handle switches without virtual chassis

https://gerrit.wikimedia.org/r/736575

Change 736575 abandoned by Andrew Bogott:

[operations/cookbooks@wmcs] sre.hosts.reimage: handle switches without virtual chassis

Reason:

git-review mishap

https://gerrit.wikimedia.org/r/736575

note: for cloudcephosd1022 in Netbox, manually updated Status from Planned to Staged.