Page MenuHomePhabricator

elastic2033 without bootable devices available
Closed, ResolvedPublic

Description

Hi Papaul!

Today I tried to reboot elastic2033 since it was listed as down by icinga, but it gets stuck in trying to PXE from the network. From the boot logs it seems that it lists "No bootable devices were detected", that may probably mean something broke on the RAID side?

Event Timeline

I left the host in the System Config panel so it will not keep trying to PXE, so it needs a power reset to start investigations :)

Note that elastic2033 is using software RAID. The data should be on RAID0, but the root partition on RAID1.

The other thing that may happen is that the mbr was installed only on one of the two disks of the RAID1, so now nothing boots. IIRC PXE wasn't also able to start as well, otherwise I'd have proposed to boot with a rescue image to inspect the two disks.
All these failures smell like a major host problem..

Papaul subscribed.

@elukey the server is back up. All yours

pt1979@elastic2033:~$ cat /proc/mdstat
Personalities : [raid1] [raid0] [linear] [multipath] [raid6] [raid5] [raid4] [raid10]
md1 : active raid0 sda2[0] sdb2[1]
      1503967232 blocks super 1.2 512k chunks

md0 : active raid1 sdb1[1] sda1[0]
      29279232 blocks super 1.2 [2/2] [UU]

Mentioned in SAL (#wikimedia-operations) [2021-05-05T23:35:18Z] <ryankemper> T281621 T281327 [Elastic] Banned elastic2033 and elastic2043 from the Cirrussearch Elasticsearch clusters

@Papaul what did you do to fix it?? (curious)

Thanks!

@RKemper I restarted the failed prometheus units on the node to clear icinga, but puppet is still disable, can you enable it when you have a moment if ok? (I didn't want to do it in case you were working on it)

Looks like I didn't comment back here but I re-enabled puppet on May 5. The host has been healthy since then.