Page MenuHomePhabricator

DC-OpsGroup
ActivePublic

Members (7)

Watchers (1)

Details

Description

Tasks handled by the Wikimedia Foundation's datacenter operations team, which is a sub-team of the SRE department.

This project includes sub-project procurement, decommission-hardware, and every single datacenter site-specific project: ops-codfw, ops-drmrs, ops-eqdfw, ops-eqiad, ops-eqord, ops-esams , ops-eqsin, & ops-ulsfo.

This can be linked to via: https://phabricator.wikimedia.org/tag/dc-ops/

Please note any wikitech documentation handled by DC-Ops is linked off of https://wikitech.wikimedia.org/wiki/Dc-operations

SLAs

DC-Ops makes every attempt to resolve all tasks and requests in a timely manner. We've implemented the following SLA targets.

Please note none of these start until both the clarified start time and with proper project tags. See details for each type of task request in their section below. Please use templates listed below.

ProjectDays to ResolveSLA startTemplate
procurement90Date of Task filingProcurement Template
Racking/Installation30Arrival of Hardware to DC site
Hardware Failure / Repair10Date of Task filingHardware Failure Template
Decommission45When all sub-team steps are complete and task is assigned to on-siteServer Decommission Template

Hardware Repair

If you need to file a task requesting hardware troubleshooting, please use the File Hardware Failure Task link here or in the navbar on the left.

Troubleshooting includes hardware failures, raid re-configuration, etc...

A full runbook on how to troubleshoot hardware failures can be viewed here: https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook

Requesting Hardware

If you have a budget line item, and want to file a request for pricing, please file your procurement request via this link. If you do not yet have a budget line for the request in this fiscal year, you can still file via that link, merely list that there is no budget allocation in that section of the task.

Once hardware has been ordered, a racking task must be entered using the form. This form may also be used if a system has to be moved and re-imaged.

Decommissioning Hardware

All hardware being returned to DC-Ops for processing into spares, or into decommission state and removed from the rack.

Any hardware no longer required for use should have a task filed for decommission via the pre-defined server decommission request form.

Recent Activity

Yesterday

Maintenance_bot added a project to T363399: Q4:rack/setup/install parsoidtest1001: SRE.
Wed, Apr 24, 8:29 PM · SRE, serviceops, ops-eqiad, DC-Ops
VRiley-WMF added a comment to T362990: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005.

This was a duplicate ticket that was opened for https://phabricator.wikimedia.org/T360687

Wed, Apr 24, 7:58 PM · Patch-For-Review, SRE, SRE Observability, ops-eqiad, DC-Ops
VRiley-WMF closed T362990: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005 as Resolved.
Wed, Apr 24, 7:57 PM · Patch-For-Review, SRE, SRE Observability, ops-eqiad, DC-Ops
VRiley-WMF claimed T362990: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005.
Wed, Apr 24, 7:56 PM · Patch-For-Review, SRE, SRE Observability, ops-eqiad, DC-Ops
RobH moved T363399: Q4:rack/setup/install parsoidtest1001 from Backlog to Racking Tasks on the ops-eqiad board.
Wed, Apr 24, 7:52 PM · SRE, serviceops, ops-eqiad, DC-Ops
RobH created T363402: parsoidtest1001 implementation tracking.
Wed, Apr 24, 7:52 PM · serviceops
RobH added a parent task for T363399: Q4:rack/setup/install parsoidtest1001: Unknown Object (Task).
Wed, Apr 24, 7:50 PM · SRE, serviceops, ops-eqiad, DC-Ops
RobH created T363399: Q4:rack/setup/install parsoidtest1001.
Wed, Apr 24, 7:49 PM · SRE, serviceops, ops-eqiad, DC-Ops
Maintenance_bot added a project to T359049: hw troubleshooting: /dev/sdg disk not working properly in cloudcephosd1017.eqiad.wmnet: SRE.
Wed, Apr 24, 6:29 PM · SRE, ops-eqiad, DC-Ops, Cloud-VPS, cloud-services-team
dcaro added a comment to T359049: hw troubleshooting: /dev/sdg disk not working properly in cloudcephosd1017.eqiad.wmnet.

\o/ the drive is listed now, will add it to the cluster (will take a bit), and close the task once it's in (tomorrow most probably), thanks!

Wed, Apr 24, 5:30 PM · SRE, ops-eqiad, DC-Ops, Cloud-VPS, cloud-services-team
Jclark-ctr added a project to T359049: hw troubleshooting: /dev/sdg disk not working properly in cloudcephosd1017.eqiad.wmnet: ops-eqiad.
Wed, Apr 24, 5:29 PM · SRE, ops-eqiad, DC-Ops, Cloud-VPS, cloud-services-team
Jclark-ctr added a comment to T359049: hw troubleshooting: /dev/sdg disk not working properly in cloudcephosd1017.eqiad.wmnet.

cloudcephosd1017 looks like the drive was listed as foreign I cleared the foreign status can you verify it now?

Wed, Apr 24, 5:21 PM · SRE, ops-eqiad, DC-Ops, Cloud-VPS, cloud-services-team
jcrespo claimed T361087: backup1005 crashed.

Will reimage soon.

Wed, Apr 24, 4:51 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
VRiley-WMF changed the status of T361087: backup1005 crashed from Open to In Progress.
Wed, Apr 24, 4:49 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
VRiley-WMF added a comment to T361087: backup1005 crashed.

We have received the PERC from Dell and I have just completed swapping it out. It now looks like the system can now see the PERC (previously, it wasn't). However, it does seem that the system will need to be rebuilt. @jcrespo would you be able to verify this? Thank you!

Wed, Apr 24, 4:08 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
Maintenance_bot added a project to T363341: Q4:rack/setup/install cloudcephosd10[39-41]: SRE.
Wed, Apr 24, 3:29 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
Maintenance_bot added a project to T363344: Q4:rack/setup/install cloudcephosd10[35-38]: SRE.
Wed, Apr 24, 3:29 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
RobH moved T363344: Q4:rack/setup/install cloudcephosd10[35-38] from Backlog to Racking Tasks on the ops-eqiad board.
Wed, Apr 24, 3:21 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
RobH added a parent task for T363344: Q4:rack/setup/install cloudcephosd10[35-38]: Unknown Object (Task).
Wed, Apr 24, 3:21 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
RobH created T363344: Q4:rack/setup/install cloudcephosd10[35-38].
Wed, Apr 24, 3:20 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
dcaro reopened T359049: hw troubleshooting: /dev/sdg disk not working properly in cloudcephosd1017.eqiad.wmnet as "Open".

@Jclark-ctr the disk does not show up:

root@cloudcephosd1017:~# lsblk
NAME                                                                                                  MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                                                                                     8:0    0 223.6G  0 disk  
├─sda1                                                                                                  8:1    0   285M  0 part  
└─sda2                                                                                                  8:2    0 223.3G  0 part  
  └─md0                                                                                                 9:0    0 223.2G  0 raid1 
    ├─vg0-root                                                                                        253:0    0  74.5G  0 lvm   /
    ├─vg0-swap                                                                                        253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv                                                                                         253:9    0 103.1G  0 lvm   /srv
sdb                                                                                                     8:16   0 223.6G  0 disk  
├─sdb1                                                                                                  8:17   0   285M  0 part  
└─sdb2                                                                                                  8:18   0 223.3G  0 part  
  └─md0                                                                                                 9:0    0 223.2G  0 raid1 
    ├─vg0-root                                                                                        253:0    0  74.5G  0 lvm   /
    ├─vg0-swap                                                                                        253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv                                                                                         253:9    0 103.1G  0 lvm   /srv
sdc                                                                                                     8:32   0   1.8T  0 disk  
└─ceph--e2be6aeb--5322--46d1--bfab--4311bd82d700-osd--block--e41e0d5a--6de8--4514--be92--35f556688a21 253:4    0   1.8T  0 lvm   
sdd                                                                                                     8:48   0   1.8T  0 disk  
└─ceph--a4629599--a5e0--46c9--a3b5--629c5fa67cea-osd--block--aded60c5--058e--486f--b681--d193f100cd7d 253:8    0   1.8T  0 lvm   
sde                                                                                                     8:64   0   1.8T  0 disk  
└─ceph--5ac6969a--2e09--4a40--9590--5e6ce5d391a3-osd--block--9185ae65--08ea--4b09--91d5--921e78ce3a48 253:7    0   1.8T  0 lvm   
sdf                                                                                                     8:80   0   1.8T  0 disk  
└─ceph--cef3e76e--29ab--4727--9472--80c01446fdab-osd--block--d473469b--a18f--4641--8ee3--76137839db18 253:6    0   1.8T  0 lvm   
sdg                                                                                                     8:96   0   1.8T  0 disk  
└─ceph--85015f5b--c44b--4ed3--ac58--7d1f41ca887c-osd--block--b0c403f0--1043--4e32--a060--bedd130168aa 253:5    0   1.8T  0 lvm   
sdh                                                                                                     8:112  0   1.8T  0 disk  
└─ceph--9835625e--c798--4681--b167--90bf3c54d84c-osd--block--05917d6e--5a5c--437c--9e4a--8366c9922df0 253:3    0   1.8T  0 lvm   
sdi                                                                                                     8:128  0   1.8T  0 disk  
└─ceph--13635d88--e8cb--403b--97b3--5cb3706525a9-osd--block--0bb8ecb0--d0e4--467d--8c15--cef999192023 253:2    0   1.8T  0 lvm
Wed, Apr 24, 3:20 PM · SRE, ops-eqiad, DC-Ops, Cloud-VPS, cloud-services-team
RobH added a parent task for T363341: Q4:rack/setup/install cloudcephosd10[39-41]: Unknown Object (Task).
Wed, Apr 24, 3:13 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
RobH moved T363341: Q4:rack/setup/install cloudcephosd10[39-41] from Backlog to Racking Tasks on the ops-eqiad board.
Wed, Apr 24, 3:13 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
RobH created T363341: Q4:rack/setup/install cloudcephosd10[39-41].
Wed, Apr 24, 3:12 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
Jclark-ctr added a comment to T362871: hw troubleshooting: disk failure for an-worker1087.

server is out of warranty

Wed, Apr 24, 3:01 PM · SRE, ops-eqiad, DC-Ops
Jclark-ctr claimed T362990: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005.

Opened request with Dell
You have successfully submitted request SR189381173.

Wed, Apr 24, 2:45 PM · Patch-For-Review, SRE, SRE Observability, ops-eqiad, DC-Ops
Jclark-ctr closed T359049: hw troubleshooting: /dev/sdg disk not working properly in cloudcephosd1017.eqiad.wmnet as Resolved.
Wed, Apr 24, 2:45 PM · SRE, ops-eqiad, DC-Ops, Cloud-VPS, cloud-services-team
Jclark-ctr closed T358763: hw move: GPU from stat1005 to stat1010 as Resolved.

Installed Gpu into stat1010

Wed, Apr 24, 2:02 PM · SRE, ops-eqiad, DC-Ops

Tue, Apr 23

Maintenance_bot added a project to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010: SRE.
Tue, Apr 23, 8:29 PM · SRE, serviceops, ops-eqiad, DC-Ops
RobH moved T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 from Backlog to Racking Tasks on the ops-eqiad board.
Tue, Apr 23, 7:31 PM · SRE, serviceops, ops-eqiad, DC-Ops
RobH added a parent task for T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010: Unknown Object (Task).
Tue, Apr 23, 7:31 PM · SRE, serviceops, ops-eqiad, DC-Ops
RobH created T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.
Tue, Apr 23, 7:30 PM · SRE, serviceops, ops-eqiad, DC-Ops
Andrew reassigned T348643: cloudcephosd1021-1034: hard drive sector errors increasing from Andrew to dcaro.
Tue, Apr 23, 7:29 PM · cloud-services-team (FY2023/2024-Q3-Q4), SRE, ops-eqiad, DC-Ops, Cloud-VPS
Maintenance_bot added a project to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010: SRE.
Tue, Apr 23, 7:29 PM · SRE, ops-codfw, serviceops, DC-Ops
RobH added a parent task for T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010: Unknown Object (Task).
Tue, Apr 23, 7:22 PM · SRE, ops-codfw, serviceops, DC-Ops
RobH renamed T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 from Q#:rack/setup/install kafka-main200[6789] & kafka-main2010 to Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.
Tue, Apr 23, 7:22 PM · SRE, ops-codfw, serviceops, DC-Ops
RobH moved T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 from Backlog to Racking Tasks on the ops-codfw board.
Tue, Apr 23, 7:22 PM · SRE, ops-codfw, serviceops, DC-Ops
RobH added a project to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010: ops-codfw.
Tue, Apr 23, 7:21 PM · SRE, ops-codfw, serviceops, DC-Ops
RobH renamed T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 from Q#:rack/setup/install X to Q#:rack/setup/install kafka-main200[6789] & kafka-main2010.
Tue, Apr 23, 7:21 PM · SRE, ops-codfw, serviceops, DC-Ops
RobH created T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.
Tue, Apr 23, 7:19 PM · SRE, ops-codfw, serviceops, DC-Ops
gerritbot added a comment to T362990: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005.

Change #1023423 merged by Filippo Giunchedi:

[operations/puppet@production] trafficserver: move prometheus-eqiad to prometheus1006

https://gerrit.wikimedia.org/r/1023423

Tue, Apr 23, 12:59 PM · Patch-For-Review, SRE, SRE Observability, ops-eqiad, DC-Ops
gerritbot added a project to T362990: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005: Patch-For-Review.
Tue, Apr 23, 12:55 PM · Patch-For-Review, SRE, SRE Observability, ops-eqiad, DC-Ops
gerritbot added a comment to T362990: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005.

Change #1023423 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] trafficserver: move prometheus-eqiad to prometheus1006

https://gerrit.wikimedia.org/r/1023423

Tue, Apr 23, 12:55 PM · Patch-For-Review, SRE, SRE Observability, ops-eqiad, DC-Ops

Mon, Apr 22

Jhancock.wm updated the task description for T362729: Q4:rack/setup/install cp70[01-16].
Mon, Apr 22, 5:23 PM · Traffic, ops-magru, DC-Ops
Jhancock.wm updated the task description for T362730: Q4:rack/setup/install magru misc servers.
Mon, Apr 22, 5:22 PM · Traffic, netops, ops-magru, DC-Ops, Infrastructure-Foundations
Jhancock.wm updated the task description for T362730: Q4:rack/setup/install magru misc servers.
Mon, Apr 22, 4:07 PM · Traffic, netops, ops-magru, DC-Ops, Infrastructure-Foundations
Jhancock.wm updated the task description for T362729: Q4:rack/setup/install cp70[01-16].
Mon, Apr 22, 4:06 PM · Traffic, ops-magru, DC-Ops

Fri, Apr 19

ssingh added a comment to T362730: Q4:rack/setup/install magru misc servers.

Thanks @cmooney, looks good! One small update to the above since we will most likely transpose these to hieradata/common/lvs/interfaces.yaml: 10.140.1.3/24 is private1-b4-magru like 10.140.0.2/24 is private1-b3-magru (and not just private-b4-magru).

Fri, Apr 19, 5:55 PM · Traffic, netops, ops-magru, DC-Ops, Infrastructure-Foundations
Maintenance_bot added a project to T362990: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005: SRE.
Fri, Apr 19, 4:29 PM · Patch-For-Review, SRE, SRE Observability, ops-eqiad, DC-Ops
colewhite updated the task description for T362990: hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005.
Fri, Apr 19, 3:40 PM · Patch-For-Review, SRE, SRE Observability, ops-eqiad, DC-Ops