eqiad: Server moves to free up space on 10g racks
Closed, ResolvedPublicRequest
Actions

Assigned To

Authored By

	wiki_willy
	Nov 2 2020, 8:17 PM

Description

Hi John, it looks like we're short 7x 2u positions on 10g racks to complete T260445. This task is to target 1g servers that can be moved out of their existing 10g rack locations to make this happen. Please provide the info that shows old --> new rack locations and switch port info, along with proposed dates for the server moves. Summary of the proposed rack moves in:
https://docs.google.com/spreadsheets/d/1om4K2iy2yx6dfQ6DGufMSDf2luoyn2zxQHyZTXJEo-4/edit#gid=1862899982

In addition, here's a quick summary of incoming upcoming hardware installs that require 10g over the next couple quarters:

Q2:

logstash[1020-1022] refresh - 3x 10g ports across 3u

Q3:

Backup storage expansion for Swift objects - 4x 10g ports across 18u
special slaves, vslow, test host (9 servers) - 9x 10g ports across 9u
Bacula expansion - 1x 10g port across 3u
kafka-logging100[123] - 3x 10g ports across 3u
cloudvirt expansion (3+1 nodes) - 8x 10g ports across 4u (wmcs rack)
ceph expansion (6+3 nodes) - 18x 10g ports across 9u (wmcs rack)

Q4:

eqiad: mc[1019-1036] refresh - 18x 10g ports across 18u
eqiad: SDC/SDAW? - 7x 10g ports across 7u

Thanks,
Willy

Details

	Subject	Repo	Branch	Lines +/-
	Update network topology for Hadoop worker nodes	operations/puppet	production	+6 -6

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		odimitrijevic	T255145 Analytics Hardware for Fiscal Year 2020/2021
Resolved		elukey	T255146 Put 24 Hadoop worker nodes in service (cluster expansion)
			Unknown Object (Task)
Resolved		• Cmjohnson	T273982 eqiad: move db1111 to rack A8
Resolved		• Cmjohnson	T273983 eqiad: Move maps1001 same rack A4
Resolved		wiki_willy	T260445 (Need By: TBD) rack/setup/install an-worker11[18-41]
Resolved	Request	Jclark-ctr	T267065 eqiad: Server moves to free up space on 10g racks
Resolved	Request	• Cmjohnson	T268810 decommission es1015.eqiad.wmnet
Resolved	Request	• Cmjohnson	T268100 decommission es1011.eqiad.wmnet
Resolved	Request	• Cmjohnson	T268101 decommission es1012.eqiad.wmnet
Resolved	Request	• Cmjohnson	T268812 decommission es1016.eqiad.wmnet

Event Timeline

wiki_willy created this task.Nov 2 2020, 8:17 PM

Restricted Application added a project: SRE. · View Herald TranscriptNov 2 2020, 8:17 PM

wiki_willy mentioned this in T260445: (Need By: TBD) rack/setup/install an-worker11[18-41].Nov 2 2020, 8:18 PM

wiki_willy added a parent task: T260445: (Need By: TBD) rack/setup/install an-worker11[18-41].

wiki_willy renamed this task from eqiad: Server Moves to Free up 7x 2u Spaces on 10g Racks to eqiad: Server Moves to Free up Space on 10g Racks.Nov 2 2020, 8:43 PM

wiki_willy updated the task description. (Show Details)

wiki_willy updated the task description. (Show Details)Nov 2 2020, 8:48 PM

Krinkle renamed this task from eqiad: Server Moves to Free up Space on 10g Racks to eqiad: Server moves to free up space on 10g racks.Nov 3 2020, 4:17 AM

These are all 1G servers in 10G racks for row A

A2		A4		A7
db1074		stat1004		mw1269
db1075		logstash1020		mw1270
db1079		wdqs1003		mw1271
db1080		ganeti1005		mw1272
db1081		continet1001		mw1273
db1082		db1111		mw1274
es1011	2U	maps1001		mw1275
es1012	2U	analytics1070	2U	mw1276
		snapshot1005		mw1277
		kubestage1001		mw1278
		scb1001		mw1279
		aqs1004		mw1281
		druid1001		mw1282
				mw1283

These are all 1G serves in 10G racks for row B

B2		B4	B7
db1099		elastic1050	wtp1031
analytics1072	2U	elastic1049	wtp1032
		conf1005	wtp1033
		Kublog1002	druid1003
		maps1002	ores1003
			cloudcontrol1004
			mw1313
			mw1314
			mw1315
			mw1316

Row C

C2		C4	C7
es1016	2U	ores1006	francium
db1100		mwlog1001	polonium
db1101		snapshot1006	scb1003
analytics1064	2U	deploy1001	elastic1051
analytics1065	2U	labsdb1010	elastic1052
analyitcs1066	2U		wtp1040
db1087			wtp1041
db1088			wtp1042
labnet1004	2U
es1015	2U
analyitcs1074	2U
db1108

Racks D2 and D7 are 100% 10G but they were initially built that way. D4 was just converted to 10G

db1114

ores1008

mc1033

mc1034

mc1035

mc1036

aqs1006

druid1003

labweb1002

wtp1046

wtp1047

wtp1048

snapshot1007

restbase1030

elastic1064

conf1006

puppetmaster1002

• Cmjohnson claimed this task.Nov 3 2020, 7:37 PM

• Cmjohnson added a subscriber: Jclark-ctr.

@wiki_willy I had time to do this today while the Dell tech worked on an-presto1004. I am going to be utilizing a 2U space in A2 and B2 for the kafka-jumbo 10G updates leaving only 15 2U spaces. We will have less than I previously reported. I am also pasting what I put in the an-worker ticket here for better tracking.

@wiki_willy and I do not have enough 10G rack space to fit 24 2U servers, Currently, I have 17 2U spaces in 10G racks. This is all I have left for servers this size.

A2 - 1
B2 - 4
B4 - 4
C2 - 2
D4- 6

Consolidated all the info @Cmjohnson provided in a Google doc, so we can add the service owners of the hosts and track future rack location, etc. below:

https://docs.google.com/spreadsheets/d/1om4K2iy2yx6dfQ6DGufMSDf2luoyn2zxQHyZTXJEo-4/edit#gid=1862899982

wiki_willy reassigned this task from • Cmjohnson to Jclark-ctr.Nov 5 2020, 7:37 PM

@elukey Hey when you get a chance can you let me know best day i can schedule with you some movies next week?

In T267065#6608905, @Jclark-ctr wrote:

@elukey Hey when you get a chance can you let me know best day i can schedule with you some movies next week?

I had a chat the past week with John over IRC and we decided to meet tomorrow to move some servers. I am going to list any constraint I have for racking:

ROW A:

stat1004 - fine to move it in any row A rack, except (if possible) A6 where stat1008 is. - This one needs ~48h of user notification before being able to move it.
aqs1004 - fine to move it in any row A rack, except (if possible) A6 where aqs1007 is.
druid1001 - fine to move it in any row A rack, except (if possible) A5 where an-druid1001 is.

ROW B:

druid1005 - fine to move it anywhere in row B except B3.

ROW C:

db1108 - fine to move it in any C rack, except if possible C7 where an-coord1002 is

ROW D

aqs1006 fine to move it in any row D rack, except (if possible) D3 where aqs1009 is.

Going to make a separate list for the hadoop worker nodes:

analytics1070 - A4

analytics1072 - B2

analytics1064 - C2
analytics1065 - C2
analytics1066 - C2

analytics1074 - C2

Can I have a list of proposed rack moves before we proceed so I can check if they are ok first? We have precise racking settings for hadoop to ensure data is spread around in a good way, and I'd like to keep it balanced if possible.. I'd also need to change some settings before starting :)

@Jclark-ctr if you can give me the network ports you intend to use I will have them pre-configured as well.

@elukey Please Review the racks i have recommended let me know if anything needs to change
@Cmjohnson Will wait for after Luca gives the ok to configure ports

old rack new_rack, Unit, Switchport
A4 stat1004 A5 U32 Port32
A4 analytics1070 A5 U37 port37
A4 aqs1004 A3. U13 port12
A4 druid1001

row B
B2 analytics1072 B3 U40 Port15
B4 conf1005 B3 U36 Port14
B7 druid1005 B6 U26 Port25

Row C
C2 analytics1064 C3 U35 port 29
C2 analytics1065 C3 U37 port 30
C2 analyitcs1066 C3 U39 port 31
C2 analyitcs1074 C3 U6 port 14
C2 db1108 C3 U34 port34

Row D
D4 aqs1006 D6 U34 port34
D4 druid1003 D6 U33 port33
D4 conf1006 D6 U34 port34

@Jclark-ctr I checked and I have only a couple of comments:

B7 druid1003 B6 U26 Port25 - this is druid1005 right?
could we place druid1001 in a rack that it is not A6? There is another druid node in there, so I'd prefer to keep them in separate racks if possible.

Also I need to coordinate with @Joe for the conf100x hosts, but the racking looks fine.

elukey updated the task description. (Show Details)Nov 10 2020, 4:05 PM

Change 640448 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Update network topology for Hadoop worker nodes

https://gerrit.wikimedia.org/r/640448

gerritbot added a project: Patch-For-Review.Nov 10 2020, 4:17 PM

@wiki_willy Hi! Do we have a timeline on how much time it will take to move the hosts to free space for the new hadoop worker nodes? I am asking since I'd need them racked this month if possible (I can help in the bootstrap os install etc.. of course), otherwise I'll make other plans :) Thanks!

@Cmjohnson would you be able to do switch ports?
A4 stat1004 A5 U32 Port32
A4 analytics1070 A5 U37 port37
A4 aqs1004 A3. U13 port12
A4 druid1001

row B
B2 analytics1072 B3 U40 Port15
B4 conf1005 B3 U36 Port14
B7 druid1005 B6 U26 Port25

Row C
C2 analytics1064 C3 U35 port 29
C2 analytics1065 C3 U37 port 30
C2 analyitcs1066 C3 U39 port 31
C2 analyitcs1074 C3 U6 port 14
C2 db1108 C3 U34 port34

Row D
D4 aqs1006 D6 U34 port34
D4 druid1003 D6 U33 port33
D4 conf1006 D6 U34 port34

Hi @elukey - it's pretty close. Once @Cmjohnson and @Jclark-ctr work out the configuration on the switch ports, you should be good to go. Thanks, Willy

Change 640448 merged by Elukey:
[operations/puppet@production] Update network topology for Hadoop worker nodes

https://gerrit.wikimedia.org/r/640448

For what is worth

conf*, kubestage*, mw*, scb*, wtp*, ores*, restbase* can all be taken offline for extended periods of time.
ganeti1005 will need to be emptied of VMs (I 'll need a 24H advance notice)
maps* might be problematic at the current state of the infrastructure, leading to a difficult to get out of outage. Adding @hnowlan for input/advice.

Maintenance_bot removed a project: Patch-For-Review.Nov 19 2020, 10:10 AM

• Cmjohnson moved this task from Backlog to High Priority Task on the ops-eqiad board.Nov 19 2020, 6:05 PM

Mentioned in SAL (#wikimedia-operations) [2020-11-23T16:37:30Z] <elukey> move analytics1070 from rack A7 to rack A5 - T267065

Mentioned in SAL (#wikimedia-operations) [2020-11-23T17:12:19Z] <elukey> move aqs1004 from rack A4 to A3 - T267065

maps* will be a slight issue - this cluster is underprovisioned at the moment and removing them will cause instability. However, neither are masters so moving them will not cause data loss. Depending on when things are happening I could have more capacity in place beforehand to head this off. Do you have an estimate for how long the move will take?

Mentioned in SAL (#wikimedia-operations) [2020-11-24T14:58:48Z] <elukey> move analytics1072 from rack B2 to B3 - T267065

Mentioned in SAL (#wikimedia-operations) [2020-11-24T15:38:11Z] <elukey> move druid1005 from rack B7 to B6 - T267065

Mentioned in SAL (#wikimedia-operations) [2020-11-24T16:29:15Z] <elukey> move analytics1064 from C2 to C3 eqiad - T267065

WDoranWMF added a project: Platform Engineering.Nov 25 2020, 2:29 PM

Mentioned in SAL (#wikimedia-operations) [2020-11-25T15:38:14Z] <elukey> move stat1004 to A5 - T267065

Mentioned in SAL (#wikimedia-operations) [2020-11-25T16:11:34Z] <elukey> move analytics1065 to C3 - T267065

Mentioned in SAL (#wikimedia-operations) [2020-11-25T16:46:09Z] <elukey> move analytics1066 to C3 - T267065

wiki_willy updated the task description. (Show Details)Nov 25 2020, 10:12 PM

wiki_willy updated the task description. (Show Details)

@wiki_willy re: upcoming 10g rack space needed, there is also T260445 (24 Hadoop worker nodes) :)

All db that had hostnames under db1095 will be replaced by new ones, so those will go away (T258361).
es1011 (2U) has been decommissioned (T268100)
es1012 (2U) has been decommissioned (T268101)
es1015 (2U) will be decommissioned in a few days (T268810)
es1016 (2U) will be decommissioned in a few days (T268812)

• Marostegui added subtasks: T268810: decommission es1015.eqiad.wmnet, T268100: decommission es1011.eqiad.wmnet, T268101: decommission es1012.eqiad.wmnet.Nov 26 2020, 9:28 AM

• Marostegui added a subtask: T268812: decommission es1016.eqiad.wmnet.Nov 26 2020, 9:31 AM

All probably all the other databases (or most of them, if they are not masters) with hostnames higher than db1095 can probably be moved to 1G racks if needed. We'd need to schedule it with DC-Ops but it can be done.

Hi @elukey, I have the 24x worker nodes covered in the first part of the task description. We're looking pretty good after the recent moves, so @Cmjohnson should be able to start getting those set up next week. Thanks, Willy

In T267065#6650709, @elukey wrote:

@wiki_willy re: upcoming 10g rack space needed, there is also T260445 (24 Hadoop worker nodes) :)

In T267065#6650853, @wiki_willy wrote:

Hi @elukey, I have the 24x worker nodes covered in the first part of the task description. We're looking pretty good after the recent moves, so @Cmjohnson should be able to start getting those set up next week. Thanks, Willy

In T267065#6650709, @elukey wrote:

@wiki_willy re: upcoming 10g rack space needed, there is also T260445 (24 Hadoop worker nodes) :)

Awesome news thanks a lot!

• Clarakosi edited projects, added Platform Team Workboards (Green); removed Platform Engineering.Dec 1 2020, 9:32 PM

Mentioned in SAL (#wikimedia-operations) [2020-12-03T11:46:06Z] <elukey> move druid1001 to rack A1 - T267065

Mentioned in SAL (#wikimedia-operations) [2020-12-03T12:17:19Z] <elukey> move aqs1006 to rack D6 - T267065

Mentioned in SAL (#wikimedia-operations) [2020-12-03T13:00:21Z] <elukey> move db1108 to C3 - T267065

Mentioned in SAL (#wikimedia-operations) [2020-12-03T15:45:59Z] <elukey> moved conf1005 to rack B3 - T267065

Note for conf1006 - this node is set as target for pybal in ulsfo, and after a chat with @akosiaris it is not clear what happens if pybal gets restarted when conf1006 is down. If possible let's schedule the move as last; if we want to proceed with it we'd need to do some puppet changes / pybal restarts to free conf1006 and allow a safer rack move.

elukey mentioned this in T169765: pybal should automatically reconnect to etcd.Dec 4 2020, 10:06 AM

• Cmjohnson closed subtask T268810: decommission es1015.eqiad.wmnet as Resolved.Dec 10 2020, 5:28 PM

• Cmjohnson closed subtask T268812: decommission es1016.eqiad.wmnet as Resolved.

Dzahn mentioned this in T266164: eqiad: Physical moves for MediaWiki servers.Dec 18 2020, 12:04 AM

• Cmjohnson added a parent task: T273982: eqiad: move db1111 to rack A8.Feb 5 2021, 3:48 PM

• Cmjohnson added a parent task: T273983: eqiad: Move maps1001 same rack A4.Feb 5 2021, 3:53 PM

RhinosF1 subscribed.Feb 5 2021, 5:37 PM

• Cmjohnson moved this task from High Priority Task to Racking Tasks on the ops-eqiad board.Aug 26 2021, 7:59 PM

• Cmjohnson moved this task from Racking Tasks to Lower Priority Items on the ops-eqiad board.

no longer needed for 10g space in eqiad