Page MenuHomePhabricator

jcrespo (Jaime Crespo)
Sr Database Administrator

Projects (12)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
May 11 2015, 8:31 AM (467 w, 3 d)
Availability
Available
IRC Nick
jynus
LDAP User
Jcrespo
MediaWiki User
JCrespo (WMF) [ Global Accounts ]

Recent Activity

Yesterday

jcrespo added a comment to T361087: backup1005 crashed.

In any case, at this point I 'd prefer to do an in-place upgrade rather than a reimage, given how unreliable a reimage is and how impactful it can be for stateful services.

Thu, Apr 25, 3:38 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
jcrespo updated subscribers of T361087: backup1005 crashed.

Booting failed (PXE):

PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al


Debian 12 (bookworm) amd64 (Wikimedia edition)

                                              boot: 
Loading debian-installer/amd64/linux... ok
Loading debian-installer/amd64/initrd.gz...
Boot failed: press a key to retry, or wait for reset...

Hmm. Not sure if we've seen this problem before. DHCP clearly worked as did the debian image download, but Linux failed to load for some reason.

@jcrespo the only difference was selecting bullseye rather than bookworm on the second attempt?

Thu, Apr 25, 3:29 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
jcrespo added a comment to T361087: backup1005 crashed.

If booted into bullseye.

Thu, Apr 25, 11:40 AM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
jcrespo added a comment to T361087: backup1005 crashed.

Booting failed (PXE):

PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al
Thu, Apr 25, 11:15 AM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups

Wed, Apr 24

jcrespo claimed T361087: backup1005 crashed.

Will reimage soon.

Wed, Apr 24, 4:51 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
jcrespo awarded T363186: Cache mw-mcrouter service ClusterIP in apcu cache a Love token.
Wed, Apr 24, 12:02 PM · MediaWiki-Engineering, serviceops, Sustainability (Incident Followup)

Tue, Apr 23

jcrespo closed T349397: Migrate the matomo host to bookworm as Resolved.

Looking good now:

Tue, Apr 23, 6:11 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Data-Engineering
jcrespo closed T349397: Migrate the matomo host to bookworm, a subtask of T288804: Upgrade the Data Engineering infrastructure to Debian Bullseye, as Resolved.
Tue, Apr 23, 6:09 PM · Data-Platform-SRE, Epic
jcrespo added a comment to T349397: Migrate the matomo host to bookworm.

hi, backups of matomo database failed with:

Tue, Apr 23, 3:46 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Data-Engineering
jcrespo reopened T349397: Migrate the matomo host to bookworm, a subtask of T288804: Upgrade the Data Engineering infrastructure to Debian Bullseye, as Open.
Tue, Apr 23, 3:45 PM · Data-Platform-SRE, Epic
jcrespo reopened T349397: Migrate the matomo host to bookworm as "Open".
Tue, Apr 23, 3:45 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Data-Engineering
jcrespo added a comment to T360751: Upgrade backup sources to MariaDB 10.6.

update: on both eqiad and codfw we are generating dumps and snapshots in 10.6 for x1, s2, s6, s5, s3.

Tue, Apr 23, 11:20 AM · Patch-For-Review, Data-Persistence, Data-Persistence-Backup

Thu, Apr 18

jcrespo updated the task description for T362509: Setup new dbprov hosts and decommission the old ones.
Thu, Apr 18, 9:08 AM · Patch-For-Review, database-backups, Data-Persistence-Backup
jcrespo added a comment to T362421: magru network setup.

Hi, after 73470d0dca68abee0 ntp no longer auto-restarts, but after one of the latest changes (I believe b48874a81565b7051be39659c056), it is pending. Can it be restarted or should it be kept with the old config for a while, and it should be acked?

Thu, Apr 18, 9:07 AM · Patch-For-Review, netops, SRE, Infrastructure-Foundations

Wed, Apr 17

jcrespo created T362766: 2024-04-17 mw-on-k8s eqiad outage.
Wed, Apr 17, 11:22 AM · serviceops, Sustainability (Incident Followup)
jcrespo created P60740 (An Untitled Masterwork).
Wed, Apr 17, 8:39 AM
jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.
Wed, Apr 17, 8:00 AM · Patch-For-Review, Data-Persistence, Data-Persistence-Backup
jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.
Wed, Apr 17, 7:55 AM · Patch-For-Review, Data-Persistence, Data-Persistence-Backup

Tue, Apr 16

jcrespo updated the task description for T358741: Decommission db2096-db2120.
Tue, Apr 16, 1:54 PM · Patch-For-Review, DBA
jcrespo added a comment to T358936: Kubernetes apiserver probe failures on restart.

Hi, today we had another occurrence of this. We didn't consider it a full-blown incident due to the no direct (or almost no) impact on users during the service down. After kubemaster1002 was detected as down during its automatic restart (due to a puppet change), it took a long time to come back- with lots of incoming network connections stuck/failing, and maximizing cpu usage. https://grafana.wikimedia.org/goto/KbF5zPaIg?orgId=1

image.png (410×1 px, 58 KB)

Tue, Apr 16, 11:04 AM · Prod-Kubernetes, serviceops, SRE
jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.
Tue, Apr 16, 8:15 AM · Patch-For-Review, Data-Persistence, Data-Persistence-Backup
jcrespo added a comment to T360751: Upgrade backup sources to MariaDB 10.6.

@Marostegui Update: backups for x1, s2, s6, s5 and s3 are generating dumps and snapshots with MariaDB 10.6 currently on codfw. Doing s5 and s3 on eqiad next. You may see a lot of 10.4 servers, but they are idle and only kept just in case, they are not active, and will be just eventually upgraded or discarded.

Tue, Apr 16, 8:11 AM · Patch-For-Review, Data-Persistence, Data-Persistence-Backup
jcrespo closed T362611: Alert in need of triage: SystemdUnitFailed (instance db2200:9100) as Resolved.
[09:44] <jinxer-wm> (SystemdUnitFailed) resolved: prometheus-mysqld-exporter.service on db2200:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
Tue, Apr 16, 7:59 AM · DBA, sre-alert-triage

Mon, Apr 15

jcrespo added a comment to T355353: Q3:rack/setup/install dbprov100[56].

Thank you a lot, to everybody!

Mon, Apr 15, 8:34 PM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops
jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.
Mon, Apr 15, 5:54 PM · Patch-For-Review, Data-Persistence, Data-Persistence-Backup
jcrespo placed T362311: Decommission db2101 (was: db2101 crashed) up for grabs.

CC @ABran-WMF in case I missed something.

Mon, Apr 15, 5:52 PM · SRE, ops-codfw, decommission-hardware, DC-Ops, Patch-For-Review, database-backups, Data-Persistence-Backup, DBA
jcrespo updated the task description for T362311: Decommission db2101 (was: db2101 crashed).
Mon, Apr 15, 5:51 PM · SRE, ops-codfw, decommission-hardware, DC-Ops, Patch-For-Review, database-backups, Data-Persistence-Backup, DBA
jcrespo updated the task description for T358741: Decommission db2096-db2120.
Mon, Apr 15, 9:26 AM · Patch-For-Review, DBA
jcrespo added a subtask for T358741: Decommission db2096-db2120: T362311: Decommission db2101 (was: db2101 crashed).
Mon, Apr 15, 9:25 AM · Patch-For-Review, DBA
jcrespo added a parent task for T362311: Decommission db2101 (was: db2101 crashed): T358741: Decommission db2096-db2120.
Mon, Apr 15, 9:25 AM · SRE, ops-codfw, decommission-hardware, DC-Ops, Patch-For-Review, database-backups, Data-Persistence-Backup, DBA
jcrespo renamed T362311: Decommission db2101 (was: db2101 crashed) from db2101 crashed to Decommission db2101 (was: db2101 crashed).
Mon, Apr 15, 9:21 AM · SRE, ops-codfw, decommission-hardware, DC-Ops, Patch-For-Review, database-backups, Data-Persistence-Backup, DBA
jcrespo created T362509: Setup new dbprov hosts and decommission the old ones.
Mon, Apr 15, 8:03 AM · Patch-For-Review, database-backups, Data-Persistence-Backup

Fri, Apr 12

jcrespo added a comment to T355422: Productionize db2196-db2220.

I think we can resolve this and track that at T358741, as long as everybody is aware.

Fri, Apr 12, 10:51 AM · database-backups, Patch-For-Review, DBA
jcrespo added a project to T355422: Productionize db2196-db2220: database-backups.

This is now done, although it depends on the definition of productionize- as some of the backup sources have the exact same data and config than the original ones, but have not yet taken over the service, and some backups still use the old hosts.

Fri, Apr 12, 10:35 AM · database-backups, Patch-For-Review, DBA
jcrespo updated the task description for T355422: Productionize db2196-db2220.
Fri, Apr 12, 10:33 AM · database-backups, Patch-For-Review, DBA
jcrespo reopened T355353: Q3:rack/setup/install dbprov100[56] as "Open".

hi, we cannot ssh into dbprov1006.eqiad.wmnet

Fri, Apr 12, 7:25 AM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops

Thu, Apr 11

jcrespo added a comment to T360149: Create a database for Striker test instance.

No need. I just wanted to warn the DBAs- althought you may find it interesting, as the last issue was with wikireplicas. No need to change anything at the moment (actual data and name), but the current grants providing access.

Thu, Apr 11, 4:06 PM · Data-Persistence-Backup, Patch-For-Review, DBA, cloud-services-team, Striker
jcrespo reassigned T360149: Create a database for Striker test instance from jcrespo to ABran-WMF.

Please see my last comment. Other than that, my work is done.

Thu, Apr 11, 3:52 PM · Data-Persistence-Backup, Patch-For-Review, DBA, cloud-services-team, Striker
jcrespo moved T360149: Create a database for Striker test instance from Done to Refine on the DBA board.
Thu, Apr 11, 3:51 PM · Data-Persistence-Backup, Patch-For-Review, DBA, cloud-services-team, Striker
jcrespo added a comment to T360149: Create a database for Striker test instance.

There is some issues on the already provided user grants. I don't think we should create databases with underscores or percentage signs on the name- but if we have to do it, let's use the proper escaping on grants to avoid things like this:

Thu, Apr 11, 3:42 PM · Data-Persistence-Backup, Patch-For-Review, DBA, cloud-services-team, Striker
jcrespo updated the task description for T355422: Productionize db2196-db2220.
Thu, Apr 11, 9:53 AM · database-backups, Patch-For-Review, DBA
jcrespo created T362311: Decommission db2101 (was: db2101 crashed).
Thu, Apr 11, 9:44 AM · SRE, ops-codfw, decommission-hardware, DC-Ops, Patch-For-Review, database-backups, Data-Persistence-Backup, DBA

Wed, Apr 10

jcrespo added a comment to T355422: Productionize db2196-db2220.

Thanks, db2199 and db2200 are almost finished (currently catching up and about to add them to tendril and later reenable notifications).

Wed, Apr 10, 3:40 PM · database-backups, Patch-For-Review, DBA
jcrespo added a comment to T360751: Upgrade backup sources to MariaDB 10.6.

Thanks, that informs me that either 2 or 3 should be upgraded by the end of the quarter, then. Thank you a lot!

Wed, Apr 10, 3:37 PM · Patch-For-Review, Data-Persistence, Data-Persistence-Backup

Tue, Apr 9

jcrespo added a comment to T360751: Upgrade backup sources to MariaDB 10.6.

@jcrespo s2 fully done. Next is s3.

Tue, Apr 9, 5:12 PM · Patch-For-Review, Data-Persistence, Data-Persistence-Backup
jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.
Tue, Apr 9, 12:01 PM · Patch-For-Review, Data-Persistence, Data-Persistence-Backup
jcrespo updated the task description for T355422: Productionize db2196-db2220.
Tue, Apr 9, 11:51 AM · database-backups, Patch-For-Review, DBA

Mon, Apr 8

jcrespo awarded T352647: Move Cassandra clusters to PKI a Love token.
Mon, Apr 8, 2:07 PM · Patch-For-Review, Data-Persistence, Cassandra

Fri, Apr 5

jcrespo added a comment to T262668: WMF media storage must be adequately backed up.

Cloning speed for 133 GB / 28K objects:

# rclone copy -P backup2007:mediabackups/commonswiki/fff backup2011:mediabackups/commonswiki/
Transferred:      133.243 GiB / 133.243 GiB, 100%, 125.044 MiB/s, ETA 0s
Transferred:        28850 / 28850, 100%
Elapsed time:     14m24.4s
Fri, Apr 5, 6:17 AM · media-backups, Data-Persistence-Backup, Epic, Goal, SRE, SRE-swift-storage
jcrespo closed T361718: Resharded files fail to be deleted/recovered as Resolved.
Fri, Apr 5, 5:20 AM · media-backups, Data-Persistence-Backup

Thu, Apr 4

jcrespo updated subscribers of T361851: db2214 crashed.
Thu, Apr 4, 4:30 PM · SRE, ops-codfw, Patch-For-Review, DBA
jcrespo added a comment to T361851: db2214 crashed.

Sadly I was unable to log in using http. This is the generic error I got on command line:

--------------------------------------------------------------------------------                                              
SeqNumber       = 11317                                                                                                       
Message ID      = CTL129                                                                                                      
Category        = Storage                                                                                                     
AgentID         = iDRAC                                                                                                       
Severity        = Information                                                                                                 
Timestamp       = 2024-04-04 15:33:05                                                                                         
Message         = The boot media of the Controller RAID Controller in SL 3 is Disk.Virtual.239:RAID.SL.3-1.                   
Message Arg   1 = RAID Controller in SL 3                                                                                     
Message Arg   2 = Disk.Virtual.239:RAID.SL.3-1                                                                                
FQDD            = RAID.SL.3-1                                                                                                 
--------------------------------------------------------------------------------                                              
SeqNumber       = 11316                                                                                                       
Message ID      = SYS1003                                                                                                     
Category        = Audit                                                                                                       
AgentID         = DE                                                                                                          
Severity        = Information                                                                                                 
Timestamp       = 2024-04-04 16:29:32                                                                                         
Message         = System CPU Resetting.                                                                                       
FQDD            = iDRAC.Embedded.1#HostPowerCtrl                                                                              
--------------------------------------------------------------------------------                                              
SeqNumber       = 11315                                                                                                       
Message ID      = SYS1000                                                                                                     
Category        = Audit                                                                                                       
AgentID         = DE                                                                                                          
Severity        = Information                                                                                                 
Timestamp       = 2024-04-04 16:29:06                                                                                         
Message         = System is turning on.                                                                                       
FQDD            = iDRAC.Embedded.1#HostPowerCtrl                                                                              
--------------------------------------------------------------------------------                                              
SeqNumber       = 11314                                                                                                       
Message ID      = SWC5019                                                                                                     
Category        = System                                                                                                      
AgentID         = DE                                                                                                          
Severity        = Warning                                                                                                     
Timestamp       = 2024-04-04 16:29:00                                                                                         
Message         = Unable to authenticate the BIOS image file because:  Internal Errors: Bypassing bios verification and booting the host.                                                                                                                   
Message Arg   1 =  Internal Errors: Bypassing bios verification and booting the host                                          
--------------------------------------------------------------------------------                                              
SeqNumber       = 11313                                                                                                       
Message ID      = RAC0701                                                                                                     
Category        = Audit                                                                                                       
AgentID         = RACLOG                                                                                                      
Severity        = Information                                                                                                 
Timestamp       = 2024-04-04 16:27:58                                                                                         
Message         = Requested system powerup.                                                                                   
FQDD            = iDRAC.Embedded.1                                                                                            
--------------------------------------------------------------------------------                                              
SeqNumber       = 11312                                                                                                       
Message ID      = SYS1001                                                                                                     
Category        = Audit                                                                                                       
AgentID         = DE                                                                                                          
Severity        = Information                                                                                                 
Timestamp       = 2024-04-04 16:26:25                                                                                         
Message         = System is turning off.                                                                                      
FQDD            = iDRAC.Embedded.1#HostPowerCtrl                                                                              
--------------------------------------------------------------------------------                                              
SeqNumber       = 11311                                                                                                       
Message ID      = SYS1003                                                                                                     
Category        = Audit                                                                                                       
AgentID         = DE                                                                                                          
Severity        = Information                                                                                                 
Timestamp       = 2024-04-04 16:26:25                                                                                         
Message         = System CPU Resetting.                                                                                       
FQDD            = iDRAC.Embedded.1#HostPowerCtrl                                                                              
--------------------------------------------------------------------------------                                              
SeqNumber       = 11310                                                                                                       
Message ID      = NIC100                                                                                                      
Category        = System                                                                                                      
AgentID         = iDRAC                                                                                                       
Severity        = Warning                                                                                                     
Timestamp       = 2024-04-04 16:26:25                                                                                         
Message         = The Embedded NIC 1 Port 1 network link is down.                                                             
Message Arg   1 = Embedded NIC 1                                                                                              
Message Arg   2 = 1                                                                                                           
FQDD            = NIC.Embedded.1-1-1
Thu, Apr 4, 4:22 PM · SRE, ops-codfw, Patch-For-Review, DBA
jcrespo added a comment to T361851: db2214 crashed.

Looked like a host server crash.

Thu, Apr 4, 3:39 PM · SRE, ops-codfw, Patch-For-Review, DBA
jcrespo renamed T361851: db2214 crashed from db2214 is down to db2214 crashed.
Thu, Apr 4, 3:38 PM · SRE, ops-codfw, Patch-For-Review, DBA
jcrespo added a comment to T360751: Upgrade backup sources to MariaDB 10.6.

Thanks for keeping me up to date, will rearrange the backup sources and dbprov accordingly. s6 eqiad and s5 next ?

Thu, Apr 4, 8:57 AM · Patch-For-Review, Data-Persistence, Data-Persistence-Backup
ABran-WMF awarded T361718: Resharded files fail to be deleted/recovered a Party Time token.
Thu, Apr 4, 6:25 AM · media-backups, Data-Persistence-Backup

Wed, Apr 3

jcrespo added a comment to T355422: Productionize db2196-db2220.

I've started to provision db2198 now due to T361037.

Wed, Apr 3, 4:54 PM · database-backups, Patch-For-Review, DBA
jcrespo added a comment to T361718: Resharded files fail to be deleted/recovered.

I tested the above patch and it solved the issue:

Wed, Apr 3, 4:39 PM · media-backups, Data-Persistence-Backup
jcrespo claimed T361718: Resharded files fail to be deleted/recovered.
Wed, Apr 3, 4:25 PM · media-backups, Data-Persistence-Backup
jcrespo triaged T361718: Resharded files fail to be deleted/recovered as High priority.
Wed, Apr 3, 4:25 PM · media-backups, Data-Persistence-Backup
jcrespo created T361718: Resharded files fail to be deleted/recovered.
Wed, Apr 3, 4:24 PM · media-backups, Data-Persistence-Backup
jcrespo updated the task description for T361706: 2024-04-03 calico/typha down.
Wed, Apr 3, 2:07 PM · Patch-For-Review, Prod-Kubernetes, Wikimedia-Incident
jcrespo updated the task description for T361706: 2024-04-03 calico/typha down.
Wed, Apr 3, 2:06 PM · Patch-For-Review, Prod-Kubernetes, Wikimedia-Incident
jcrespo closed T361705: ProbeDown - miscweb1003 as Resolved.

This was due to https://www.wikimediastatus.net/incidents/7qq1gwnw71jy Services back up. A new ticket will be created for the incident.

Wed, Apr 3, 1:51 PM · collaboration-services

Tue, Apr 2

jcrespo added a comment to T358741: Decommission db2096-db2120.

db2100 can be scheduled in advance due to T361037.

Tue, Apr 2, 12:28 PM · Patch-For-Review, DBA
jcrespo closed T361037: db2100 crashed (memory error) as Declined.

Yes, we have redundancy for the backups and this will actually simplify things. I will take care of setting up the new hosts as scheduled for this quarter. Thank you.

Tue, Apr 2, 11:34 AM · SRE, ops-codfw, DC-Ops, Patch-For-Review, Data-Persistence-Backup, database-backups, DBA

Wed, Mar 27

jcrespo added a comment to T334069: Evaluate and decide the future of MinIO for media backups given the upgrade requirements and increase the available storage space.

The new shard looking great:

Wed, Mar 27, 7:05 PM · Patch-For-Review, Data-Persistence-Backup, media-backups
jcrespo added a comment to T353891: https://lists.wikimedia.org is often slow to load.

Thank you Reedy, I trust you, it was just that the title wasn't descriptive enough (exact url, logged in/logged out, etc.). The 500 is indeed a symptom of the same issue (http timeouts from varnish). Now I have more data to work with :-D. For example, the first one was from hyperkitty, not postorious, so the title was misleading to me.

Wed, Mar 27, 6:46 PM · Upstream, SRE, Performance Issue, Wikimedia-Mailing-lists
jcrespo triaged T361133: replication failure on db2115 and db2215 as High priority.
Wed, Mar 27, 6:29 PM · DBA
jcrespo added a comment to T361133: replication failure on db2115 and db2215.

I will stop replication on both db2115 and db2215 to mitigate the issue on the primary and extending to the other hosts. That will prevent overwhealming the primary, as the other replicas look of so far. I think this will require a restart of the primary to go to a healthy state, but not touching it for now.

Wed, Mar 27, 6:26 PM · DBA
jcrespo added a comment to T361133: replication failure on db2115 and db2215.

I found something weird with db2196, the slave host table is full of duplicate entries, so I am quite sure that is the problem, not the replicas. Something is weird with the primary, who is killing the replica threads:

Wed, Mar 27, 6:21 PM · DBA
jcrespo added a comment to T361133: replication failure on db2115 and db2215.

It looks like a transient network error or something else causing a connection error (TLS?). It seems to be working now, did you do something?

Wed, Mar 27, 6:07 PM · DBA
jcrespo updated the task description for T361087: backup1005 crashed.
Wed, Mar 27, 10:15 AM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
jcrespo added projects to T361087: backup1005 crashed: DC-Ops, ops-eqiad.

It seems the RAID controller has gone haywire, as there is no bootable medium, and it is stuck in an endless network boot. The RAID controllers has been mapped out. Could you have a look and request for a repair, if still under warranty? The host can be set offline at any point, but we would like to keep the data on its disks when possible.

Wed, Mar 27, 10:14 AM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
jcrespo updated the task description for T361087: backup1005 crashed.
Wed, Mar 27, 10:11 AM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
jcrespo created T361087: backup1005 crashed.
Wed, Mar 27, 10:06 AM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
jcrespo moved T360907: Can we please add our vendor to Google Postmaster Tools from Untriaged to Awaiting User Input on the SRE-Access-Requests board.
Wed, Mar 27, 9:02 AM · SRE-Access-Requests, Fundraising-Backlog
jcrespo updated subscribers of T360907: Can we please add our vendor to Google Postmaster Tools.

@DBu-WMF Hi, we are discussing how to proceed, as handling postmaster access is a new process for us.

Wed, Mar 27, 9:01 AM · SRE-Access-Requests, Fundraising-Backlog
jcrespo moved T361046: Requesting access to analytics-privatedata-users for bblack from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.

Pending approval from Data-Engineering 's list of people that can approve that access: @odimitrijevic @Milimetric @WDoranWMF or @Ahoelzl.

Wed, Mar 27, 8:50 AM · Patch-For-Review, SRE, SRE-Access-Requests
jcrespo merged T360641: Requesting access to analytics-privatedata-users for mpham into T270438: LDAP access to the wmf group for Mike Pham.
Wed, Mar 27, 8:43 AM · LDAP-Access-Requests, SRE
jcrespo merged task T360641: Requesting access to analytics-privatedata-users for mpham into T270438: LDAP access to the wmf group for Mike Pham.
Wed, Mar 27, 8:42 AM · Patch-For-Review, SRE, SRE-Access-Requests

Mar 26 2024

jcrespo renamed T361037: db2100 crashed (memory error) from db2100 crashed to db2100 crashed (memory error).
Mar 26 2024, 5:49 PM · SRE, ops-codfw, DC-Ops, Patch-For-Review, Data-Persistence-Backup, database-backups, DBA
jcrespo added projects to T361037: db2100 crashed (memory error): DC-Ops, ops-codfw.

DC Ops, the host crashed and 3 memory banks are mapped out. Can you evaluate the host and either ask for in warranty replacements or any other alternative.

Mar 26 2024, 5:49 PM · SRE, ops-codfw, DC-Ops, Patch-For-Review, Data-Persistence-Backup, database-backups, DBA
jcrespo updated the task description for T361037: db2100 crashed (memory error).
Mar 26 2024, 5:35 PM · SRE, ops-codfw, DC-Ops, Patch-For-Review, Data-Persistence-Backup, database-backups, DBA
jcrespo added a comment to T361037: db2100 crashed (memory error).

Not the first time it crashed (it rebooted): T283995

Mar 26 2024, 5:27 PM · SRE, ops-codfw, DC-Ops, Patch-For-Review, Data-Persistence-Backup, database-backups, DBA
jcrespo closed T358922: Requesting access to analytics-privatedata-users for GeorgeMikesell as Resolved.

@GMikesell-WMF (or @cchen on his behalf)- access has been merged, it may take ~30 minutes to be fully deployed on all servers, after that please check you can access the production hosts and given datasets. Kerberos principal has been also created, and instructions have been automatically sent to email.

Mar 26 2024, 5:23 PM · Patch-For-Review, SRE, SRE-Access-Requests
jcrespo created T361037: db2100 crashed (memory error).
Mar 26 2024, 5:18 PM · SRE, ops-codfw, DC-Ops, Patch-For-Review, Data-Persistence-Backup, database-backups, DBA
jcrespo moved T358922: Requesting access to analytics-privatedata-users for GeorgeMikesell from Patch in Review to Ready To Go on the SRE-Access-Requests board.
Mar 26 2024, 5:10 PM · Patch-For-Review, SRE, SRE-Access-Requests
jcrespo moved T360641: Requesting access to analytics-privatedata-users for mpham from Patch in Review to Awaiting User Input on the SRE-Access-Requests board.
Mar 26 2024, 5:10 PM · Patch-For-Review, SRE, SRE-Access-Requests
jcrespo added a comment to T360641: Requesting access to analytics-privatedata-users for mpham.

Apologies, but this access was already provided back in 2020 at T270438 (https://gerrit.wikimedia.org/r/c/operations/puppet/+/650298), and it is still active. I only realized when I got a duplicate key error.

Mar 26 2024, 4:56 PM · Patch-For-Review, SRE, SRE-Access-Requests
jcrespo claimed T360641: Requesting access to analytics-privatedata-users for mpham.
Mar 26 2024, 3:58 PM · Patch-For-Review, SRE, SRE-Access-Requests
jcrespo moved T360641: Requesting access to analytics-privatedata-users for mpham from Manager/NDA Approval/Confirmation to Patch in Review on the SRE-Access-Requests board.
Mar 26 2024, 3:58 PM · Patch-For-Review, SRE, SRE-Access-Requests
jcrespo updated the task description for T360641: Requesting access to analytics-privatedata-users for mpham.
Mar 26 2024, 3:58 PM · Patch-For-Review, SRE, SRE-Access-Requests
jcrespo added a comment to T353891: https://lists.wikimedia.org is often slow to load.

@Reedy what did you see as slow back them? Right now doing:

Mar 26 2024, 11:04 AM · Upstream, SRE, Performance Issue, Wikimedia-Mailing-lists
jcrespo removed a project from T360356: Request access to servers Dcops group: SRE-Access-Requests.

I am going to remove the SRE-Access-Requests because, while it is indeed an access request, it is not immediately actionable by people on clinic duty, but has to be discussed with the owners of the workflow (IF) + the rest of the SREs first on how exactly to provide it.

Mar 26 2024, 10:31 AM · SRE, Infrastructure-Foundations
jcrespo removed a project from T18799: Mail sent out by MediaWiki should have the Auto-Submitted header set to 'auto-generated' (RFC 3834): SRE.
Mar 26 2024, 10:25 AM · MediaWiki-Email
jcrespo added a project to T360778: Move maps/karthoterian to PKI/cfssl: Infrastructure-Foundations.
Mar 26 2024, 10:23 AM · serviceops, Maps, SRE
jcrespo added a project to T360636: Phase out cergen for ServiceOps services: serviceops.
Mar 26 2024, 10:23 AM · Patch-For-Review, serviceops, Epic, SRE
jcrespo added projects to T360902: Consolidation and tracking of automated email alerts improvements across services: Infrastructure-Foundations, observability.
Mar 26 2024, 10:22 AM

Mar 25 2024

jcrespo moved T358922: Requesting access to analytics-privatedata-users for GeorgeMikesell from Awaiting User Input to Patch in Review on the SRE-Access-Requests board.
Mar 25 2024, 6:03 PM · Patch-For-Review, SRE, SRE-Access-Requests