Page MenuHomePhabricator

rack/setup/install cloudvirt102[34]
Closed, ResolvedPublic

Description

This task will track the racking and setup of two new cloudvirt systems, cloudvirt1023 and cloudvirt1024.

Racking Proposal: These have to go in row B along with the other labvirts. If they have 10G capability, place in 10G racks.

cloudvirt1023:

  • - receive in system on procurement task T192119
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (jessie)
  • - puppet accept/initial run
  • - handoff for service implementation

cloudvirt1024:

  • - receive in system on procurement task T192119
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (jessie)
  • - puppet accept/initial run
  • - handoff for service implementation

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

So, I've gone ahead and updated the puppet repo for the installation, and they successfully PXE boot into the jessie installer. Unfortuantely, there is where we hit an issue.

It seems the network card in these (QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter, Low Profile) doesn't have built in support in jessie. It does in stretch, which is why we hadn't seen this issue yet on our other Dell 14G server installations.

│ No Ethernet card was detected. If you know the name of the driver       │
│ needed by your Ethernet card, you can select it from the list.          │

Which is amusing, since it is using the network card to boot said image.

Who is familiar with adding this support into our installer? Since cloudvirts use Jessie, this won't be the last time we experience this issue (all new cloudvirt servers are going to have this newer network card, the old card isn't available in the new Dell 14G server line.)

paste of the lspci output:

P7435

This shows:

af:00.0 Ethernet controller: QLogic Corp. Device 8070 (rev 02)
af:00.1 Ethernet controller: QLogic Corp. Device 8070 (rev 02)

Ok,

@RobH let's assume we won't be using the 2x10G NICs in the short-mid term.
How many 1G NICs do these servers have? are they disabled in BIOS?

And for completeness: could we get the unsupported 2x10G NIC replaced by Dell (or by other spare NIC of our own)?

I've emailed Dell to see what our other 10G network card options are:

Dell Team,

We are experiencing a driver support issue on the 10G Network cards ordered back in May of this year. We placed this order, referencing T192119, via quote 3000024942289.2. This is a high priority for us, as these systems were ordered months ago and this is blocking their deployment.

These included QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter, Low Profile network cards. It seems this particular network card chipset is a bit too new to be supported in linux distros like Debian Jessie (and only seem to be fully supported in new versions like stretch. We went with this card, since we were told the older Broadcom cards were unavailable in the R440 line (or that was my understanding at the time.)

However, in reviewing our other R440s, we ordered 16 systems off quote 3000025330054.2 with "Broadcom 57412 Dual Port 10Gb, SFP+, PCIe Adapter, Low Profile." Is this card available for replacement/use in our R440 cloudvirts ordered on quote 3000024942289.2?

If so, we would like a quote to order 2, to replace the network cards in service tags DZYYQP2 & DZYZQP2.

So, we need to know all our network card options for DZYYQP2 & DZYZQP2, as we want to replace the QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter. We need to have 2 10Gb sfp+ compatible network ports per system.

Please advise,

Ok,

@RobH let's assume we won't be using the 2x10G NICs in the short-mid term.
How many 1G NICs do these servers have? are they disabled in BIOS?

It has 2 1G network ports, disabled in bios since the ordering doesn't play well with the 10G and adds complexity. Please note that if/when we replace the 10G cards, it will require full downtime on the hosts to do so. It may also require re-image since we need to ensure the 10G cards work from installation onwards (simply adding them to the already installed host is not a proper test.)

And for completeness: could we get the unsupported 2x10G NIC replaced by Dell (or by other spare NIC of our own)?

Email sent (see prior comment just before this one)

Ok,

@RobH let's assume we won't be using the 2x10G NICs in the short-mid term.
How many 1G NICs do these servers have? are they disabled in BIOS?

It has 2 1G network ports, disabled in bios since the ordering doesn't play well with the 10G and adds complexity. Please note that if/when we replace the 10G cards, it will require full downtime on the hosts to do so. It may also require re-image since we need to ensure the 10G cards work from installation onwards (simply adding them to the already installed host is not a proper test.)

Ok, please, enable the 2x1G NICs. We will be using them instead of the 2x10G, so please unplug them. No problem with the reimage. These servers haven't been put into production yet.
I hope that by the time we actually need 2x10G, we will have other hardware or newer software.

And for completeness: could we get the unsupported 2x10G NIC replaced by Dell (or by other spare NIC of our own)?

Email sent (see prior comment just before this one)

Thanks!

Ok, sub-task T202345 has the 10G nics, but @aborrero's comment makes me wonder why we're ordering any cloudvirts iwth 10G nics if they are not needed? (We were lead to think they would be needed, and have gotten 10G whenever possible for them.) Discussion on pricing should go to sub-task T202345.

RobH mentioned this in Unknown Object (Task).Aug 21 2018, 4:53 PM

@Cmjohnson

Please relocate these two machines into racks with 1G (not 10G) connections. (It seems silly to leave them in 10G racks since they won't ever use them.)

The cloud team (per the recommendation of @aborrero & @Andrew) have stated 10G is not needed on cloudvirt machines, and they can simply use their 1G connections. As such, relocate these two machines to 1G racks, plug in their eth0 and eth1, and disable the 10G NICs. Once that is done (and ports udpated on the switch stack) assign back to me for install.

@chasemp two questions: 1) was there a reason we requested these with 10G? (Or, did we?) 2) Is it important that these be in a particular rack for neutron purposes?

Moved cloudvirt1023 to B1 ports ge-1/0/8 and ge-1/0/10 and cloudvirt1024 to B8 ports ge-8/0/22 and 8/0/23. BIOS will need updating to enable 1G ports again. Network switches have not been updated yet.

@chasemp two questions: 1) was there a reason we requested these with 10G? (Or, did we?) 2) Is it important that these be in a particular rack for neutron purposes?

10G in cloud[lab]virts is a weird thing that began before I took on procurement things iirc, basically somewhere along the line 10G crept into the specs and we have removed it and readded it a few times. The cloudNET servers have 10G legit, and cloudvirt may want 10G at some point but it's never been a clear necessity and I'm honestly not sure the why/when of it getting into the spec pool. Other than we continually do "order more like x" and x got 10G at some point.

Rack doesn't matter as long as row b to my knowledge, other than some racks may have 1G and some may not? IDK

Moved cloudvirt1023 to B1 ports ge-1/0/8 and ge-1/0/10 and cloudvirt1024 to B8 ports ge-8/0/22 and 8/0/23. BIOS will need updating to enable 1G ports again. Network switches have not been updated yet.

ACK

Ok, switch port update diff:

robh@asw2-b-eqiad# show | compare 
[edit interfaces interface-range disabled]
     member ge-3/0/2 { ... }
+    member xe-2/0/5;
+    member xe-7/0/22;
[edit interfaces interface-range vlan-cloud-hosts1-b-eqiad]
     member xe-7/0/13 { ... }
+    member ge-1/0/8;
+    member ge-8/0/22;
-    member xe-2/0/5;
-    member xe-7/0/22;
[edit interfaces interface-range vlan-cloud-instances1-b-eqiad]
     member ge-3/0/19 { ... }
+    member ge-1/0/10;
+    member ge-8/0/23;
-    member ge-2/0/6;
-    member ge-7/0/23;
[edit interfaces]
+   ge-1/0/8 {
+       description "cloudvirt1023 eth0";
+   }
+   ge-1/0/10 {
+       description "cloudvirt1023 eth1";
+   }
-   ge-2/0/6 {
-       description "cloudvirt1023 eth1";
-   }
-   ge-7/0/23 {
-       description "cloudvirt1024 eth1";
-   }
+   ge-8/0/22 {
+       description "cloudvirt1024 eth0";
+   }
[edit interfaces ge-8/0/23]
-   description rdb1009;
+   description "cloudvirt1024 eth1";

Moved cloudvirt1023 to B1 ports ge-1/0/8 and ge-1/0/10 and cloudvirt1024 to B8 ports ge-8/0/22 and 8/0/23. BIOS will need updating to enable 1G ports again. Network switches have not been updated yet.

Correction, it is as follows:

cloudvirt1023-eth0:ge-1/0/8
cloudvirt1023-eth1:ge-1/0/10
cloudvirt1024-eth0:ge-8/0/22
cloudvirt1024-eth1:ge-8/0/24

This is after IRC sync up with Chris, since ge-8/0/23 showed rdb1009 there.

Change 454692 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] cloudvirt102[34] mac address update

https://gerrit.wikimedia.org/r/454692

Change 454692 merged by RobH:
[operations/puppet@production] cloudvirt102[34] mac address update

https://gerrit.wikimedia.org/r/454692

So these successfully boot into the jessie installer, but then seem to lack the support for the Perc H740P hw raid controller:

  ┌──────────────────────────┤ [!] Detect disks ├───────────────────────────┐
  │                                                                         │
  │ No disk drive was detected. If you know the name of the driver needed   │
  │ by your disk drive, you can select it from the list.                    │
  │                                                                         │
  │ Driver needed for your disk drive:                                      │
  │                                                                         │
  │                     continue with no disk drive                         │
  │                     3w-9xxx                                             │
  │                     3w-sas                       ▒                      │
  │                     3w-xxxx                      ▒                      │
  │                     BusLogic                     ▒                      │
  │                     DAC960                       ▒                      │
  │                     aacraid                      ▒                      │
  │                     advansys                     ▒                      │
  │                     aic79xx                      ▒                      │
  │                     aic7xxx                                             │
  │                                                                         │
  │     <Go Back>                                                           │
  │                                                                         │
  └─────────────────────────────────────────────────────────────────────────┘

<Tab> moves; <Space> selects; <Enter> activates buttons

I've confirmed there is a raid10 of all 10 SSDs presenting in the raid bios. Testing by loading stretch instead just to see if it does see the disk. (It should, as we have other H740P machines in stretch use, but this will confirm its a jessie driver support issue.)

To head off potential questions: No we cannot put an older model raid controller into the new 14G series, the form factor won't quite fit. Also, this is using the H7XX line, which is the line that is preferred and known to work reliably under load. (So I wouldn't recommend attempting a third party raid controller.)

Same thing in stretch. This is odd, since we must have installed other R440s with H740P controllers....

RobH closed subtask Unknown Object (Task) as Declined.Aug 22 2018, 9:59 PM

So, these are booting and they have a single raid10 setup of all the disks. However, they get to the point in the installer for disk support, and prompt about not seeing the correct driver.

Installer log: P7475

Same thing in stretch. This is odd, since we must have installed other R440s with H740P controllers....

Can you double-check past orders whether we have bought this controller before?

Same thing in stretch. This is odd, since we must have installed other R440s with H740P controllers....

Can you double-check past orders whether we have bought this controller before?

I've been looking and so far the only systems with them are also not yet installed, I'll keep looking.

I've backported support for that driver the stretch 4.9 kernel, it's a series of 18 patches, kernel is at https://people.wikimedia.org/~jmm/megasas/

@RobH, @Cmjohnson : For simple testing of that driver, could we plug a spare SATA disk to one of the new servers and install a basic stretch with that (doesn't need a role assigned), that would allow a generic installation to complete and then we can test the new kernel/controller properly?

If the controller works fine with the test kernel I'll open a pull request for review/merge into the stretch kernel, when it's there it will also trickle down into the 4.9 jessie backport. There's also a number of people who've requested this backport for stretch in the Debian BTS, I'll also contact those for testing (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=890034, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=890393, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=891067, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=891713)

faidon raised the priority of this task from Medium to High.Aug 24 2018, 2:20 PM
faidon mentioned this in Unknown Object (Task).

@MoritzMuehlenhoff: Just to confirm, you'd like us to take a third SSD or SATA SFF disk, install into the system, and cable that bay from to the onboard controller rather than to the raid controller?

I'm not sure that is possible, since I'm not sure what cables ship within the chassis to accomplish this. If this is what we want to do, we'll need to sync up with @Cmjohnson on Monday to check.

@MoritzMuehlenhoff: Just to confirm, you'd like us to take a third SSD or SATA SFF disk, install into the system, and cable that bay from to the onboard controller rather than to the raid controller?

That or anything else which allows us to get a basic stretch OS installed on the server with the current d-i so that I can test the new kernel debs. The current kernels use a different ABI than what is being used in the last release of the debian-installer, so it's not trivial to replace the megaraid driver in our d-i image. Also, there's still the open question whether that updated megaraid driver will also require an updated version of firmware-qlogic and that these tests are much easier if done on a full-blown OS installation instead of busybox only. When all of this has been tested/sorted out, we can revert the cabling/temp disk and look into building a d-i image which will allow us to install these servers without manual fixups.

I took the raid controller offer to see what I could do to connect a disk to the onboard raid controller. I need SATA cables. These cables look different then what I think I have. It could be an illusion. What do you think? picture attached F25467621}

I took the raid controller offer to see what I could do to connect a disk to the onboard raid controller. I need SATA cables. These cables look different then what I think I have. It could be an illusion. What do you think? picture attached F25467621}

Can you just pickup the right cable at a BestBuy or something? If so, that is easier than going back to Dell. Please advise.

I connectd a SATA cable and turned on the SATA settings in bios to AHCI and auto for disks...the raid controller is still connected as normal. You will have to change the virtual disk and probably change a disk to non-raid mode. try that and let me know if it works. I borrowed the cable from wmf7421

RobH mentioned this in Unknown Object (Task).Sep 5 2018, 4:05 PM

I tried an installation from cloudvirt1023, but the PXELINUX version on the NIC is affected by a bug in syslinux 6.0.3 as used on the Broadcom NIC and fails to fetch the install image:

Booting from BRCM MBA Slot 0400 v20.6.1

Broadcom UNDI PXE-2.1 v20.6.1
Copyright (C) 2000-2017 Broadcom Corporation
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.

CLIENT MAC ADDR: D0 94 66 61 17 C7  GUID: 4C4C4544-005A-5910-8059-C4C04F515032
CLIENT IP: 10.64.20.42  MASK: 255.255.255.0  DHCP IP: 208.80.154.22
GATEWAY IP: 10.64.20.1

PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al

Failed to load ldlinux.c32

This is confirmed here: https://www.syslinux.org/archives/2015-September/024305.html and fixed in 6.0.4: https://www.syslinux.org/wiki/index.php?title=Syslinux_6_Changelog

@Cmjohnson : Can you check whether there are firmware updates available which fix this? Server can be shut down anytime if that is needed for the update

Change 458463 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Disable fetching the netboot image via HTTP for cloudvirt1023

https://gerrit.wikimedia.org/r/458463

I also tried to disable an HTTP-based PXE boot via https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458463/, but that didn't work either, same symptoms as above.

I tried an installation from cloudvirt1023, but the PXELINUX version on the NIC is affected by a bug in syslinux 6.0.3 as used on the Broadcom NIC and fails to fetch the install image:

Booting from BRCM MBA Slot 0400 v20.6.1

Broadcom UNDI PXE-2.1 v20.6.1
Copyright (C) 2000-2017 Broadcom Corporation
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.

CLIENT MAC ADDR: D0 94 66 61 17 C7  GUID: 4C4C4544-005A-5910-8059-C4C04F515032
CLIENT IP: 10.64.20.42  MASK: 255.255.255.0  DHCP IP: 208.80.154.22
GATEWAY IP: 10.64.20.1

PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al

Failed to load ldlinux.c32

This is confirmed here: https://www.syslinux.org/archives/2015-September/024305.html and fixed in 6.0.4: https://www.syslinux.org/wiki/index.php?title=Syslinux_6_Changelog

@Cmjohnson : Can you check whether there are firmware updates available which fix this? Server can be shut down anytime if that is needed for the update

So I've checked for firmware updates on https://www.dell.com/support/home/us/en/04/product-support/servicetag/dzyyqp2/drivers

Since cloudvirt1023 and cloudvirt1024 are identical, they also have the same firmware revisions. I'm updating them to the newest versions. However, I don't see an update for the broadcom onboard 1GB NIC available for download.

I do see one for the QLogic 10G cards (not used so no update needed anyhow), raid controller (updating) and bios (updating). Will post when they are both fully updated.

raid and bios updated, there are no broadcom updates available.

So this pxe boots in the currently deployed pxe/installer, but lacks raid controller support. So it seems there is no firmware solve for:

I tried an installation from cloudvirt1023, but the PXELINUX version on the NIC is affected by a bug in syslinux 6.0.3 as used on the Broadcom NIC and fails to fetch the install image:

Booting from BRCM MBA Slot 0400 v20.6.1

Broadcom UNDI PXE-2.1 v20.6.1
Copyright (C) 2000-2017 Broadcom Corporation
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.

CLIENT MAC ADDR: D0 94 66 61 17 C7  GUID: 4C4C4544-005A-5910-8059-C4C04F515032
CLIENT IP: 10.64.20.42  MASK: 255.255.255.0  DHCP IP: 208.80.154.22
GATEWAY IP: 10.64.20.1

PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al

Failed to load ldlinux.c32

This is confirmed here: https://www.syslinux.org/archives/2015-September/024305.html and fixed in 6.0.4: https://www.syslinux.org/wiki/index.php?title=Syslinux_6_Changelog

@Cmjohnson : Can you check whether there are firmware updates available which fix this? Server can be shut down anytime if that is needed for the update

Not sure what the next steps is for this, hopefully @MoritzMuehlenhoff can advise.

Change 458556 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] set cloudvirt1023 to install stretch via tftp not http

https://gerrit.wikimedia.org/r/458556

Change 458556 merged by RobH:
[operations/puppet@production] set cloudvirt1023 to install stretch via tftp not http

https://gerrit.wikimedia.org/r/458556

Change 458463 abandoned by Muehlenhoff:
Disable fetching the netboot image via HTTP for cloudvirt1023

Reason:
This was solved by Arzhel on the network level

https://gerrit.wikimedia.org/r/458463

As I had made a backport of the megaraid_sas driver for Perc 740/840 to the 4.9 stretch kernel anyway, I ran some tests on backup2001 (which has the new controller) and acamar (which has an older Perc controller running the megaraid_sas driver), which were successful. Submitted to the Debian kernel team in https://salsa.debian.org/kernel-team/linux/merge_requests/61

(To avoid confusion: We'll still swap the RAID controller/NICs for compatible parts, there's no backport of the QLogic 41xx NIC for 4.9)

Can I get an update on this? Are we blocked waiting for Dell to ship us RAID controllers?

I replaced the old raid card for cloudvirt1023 with the new one

@Cmjohnson, is there a card for 1024 as well and you're waiting to hear whether 1023 is a success?

@Cmjohnson, is there a card for 1024 as well and you're waiting to hear whether 1023 is a success?

@Andrew: Dell screwed up and shipped out both cards (says packing slip) but only one card arrived, so they are sending another card. Ideally we'd have swapped both at the same time, but saw no reason to hold up cloudvirt1023 waiting on the controller for cloudvirt1024.

So, cloudvirt1023 is now installed and has puppet signed and running with jessie.

Chris,

Please install the replacement h730P when it arrives this Friday into cloudvirt1024, then assign this task back to me.

Thanks!

Change 466817 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Make cloudvirt1023 a compute node

https://gerrit.wikimedia.org/r/466817

Change 466817 merged by Andrew Bogott:
[operations/puppet@production] Make cloudvirt1023 a compute node

https://gerrit.wikimedia.org/r/466817

Dell sent me a 10G NIC and not a raid card. They are rushing one out.

Chris installed the new raid controller, I'm taking for installation.

Both these hosts are now up and running VMs.