User Details
- User Since
- May 11 2015, 8:31 AM (467 w, 3 d)
- Availability
- Available
- IRC Nick
- jynus
- LDAP User
- Jcrespo
- MediaWiki User
- JCrespo (WMF) [ Global Accounts ]
Yesterday
In any case, at this point I 'd prefer to do an in-place upgrade rather than a reimage, given how unreliable a reimage is and how impactful it can be for stateful services.
If booted into bullseye.
Booting failed (PXE):
PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al
Wed, Apr 24
Will reimage soon.
Tue, Apr 23
Looking good now:
hi, backups of matomo database failed with:
update: on both eqiad and codfw we are generating dumps and snapshots in 10.6 for x1, s2, s6, s5, s3.
Thu, Apr 18
Hi, after 73470d0dca68abee0 ntp no longer auto-restarts, but after one of the latest changes (I believe b48874a81565b7051be39659c056), it is pending. Can it be restarted or should it be kept with the old config for a while, and it should be acked?
Wed, Apr 17
Tue, Apr 16
Hi, today we had another occurrence of this. We didn't consider it a full-blown incident due to the no direct (or almost no) impact on users during the service down. After kubemaster1002 was detected as down during its automatic restart (due to a puppet change), it took a long time to come back- with lots of incoming network connections stuck/failing, and maximizing cpu usage. https://grafana.wikimedia.org/goto/KbF5zPaIg?orgId=1
@Marostegui Update: backups for x1, s2, s6, s5 and s3 are generating dumps and snapshots with MariaDB 10.6 currently on codfw. Doing s5 and s3 on eqiad next. You may see a lot of 10.4 servers, but they are idle and only kept just in case, they are not active, and will be just eventually upgraded or discarded.
[09:44] <jinxer-wm> (SystemdUnitFailed) resolved: prometheus-mysqld-exporter.service on db2200:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
Mon, Apr 15
Thank you a lot, to everybody!
CC @ABran-WMF in case I missed something.
Fri, Apr 12
I think we can resolve this and track that at T358741, as long as everybody is aware.
This is now done, although it depends on the definition of productionize- as some of the backup sources have the exact same data and config than the original ones, but have not yet taken over the service, and some backups still use the old hosts.
hi, we cannot ssh into dbprov1006.eqiad.wmnet
Thu, Apr 11
No need. I just wanted to warn the DBAs- althought you may find it interesting, as the last issue was with wikireplicas. No need to change anything at the moment (actual data and name), but the current grants providing access.
Please see my last comment. Other than that, my work is done.
There is some issues on the already provided user grants. I don't think we should create databases with underscores or percentage signs on the name- but if we have to do it, let's use the proper escaping on grants to avoid things like this:
Wed, Apr 10
Thanks, db2199 and db2200 are almost finished (currently catching up and about to add them to tendril and later reenable notifications).
Thanks, that informs me that either 2 or 3 should be upgraded by the end of the quarter, then. Thank you a lot!
Tue, Apr 9
Mon, Apr 8
Fri, Apr 5
Cloning speed for 133 GB / 28K objects:
# rclone copy -P backup2007:mediabackups/commonswiki/fff backup2011:mediabackups/commonswiki/ Transferred: 133.243 GiB / 133.243 GiB, 100%, 125.044 MiB/s, ETA 0s Transferred: 28850 / 28850, 100% Elapsed time: 14m24.4s
Thu, Apr 4
Sadly I was unable to log in using http. This is the generic error I got on command line:
-------------------------------------------------------------------------------- SeqNumber = 11317 Message ID = CTL129 Category = Storage AgentID = iDRAC Severity = Information Timestamp = 2024-04-04 15:33:05 Message = The boot media of the Controller RAID Controller in SL 3 is Disk.Virtual.239:RAID.SL.3-1. Message Arg 1 = RAID Controller in SL 3 Message Arg 2 = Disk.Virtual.239:RAID.SL.3-1 FQDD = RAID.SL.3-1 -------------------------------------------------------------------------------- SeqNumber = 11316 Message ID = SYS1003 Category = Audit AgentID = DE Severity = Information Timestamp = 2024-04-04 16:29:32 Message = System CPU Resetting. FQDD = iDRAC.Embedded.1#HostPowerCtrl -------------------------------------------------------------------------------- SeqNumber = 11315 Message ID = SYS1000 Category = Audit AgentID = DE Severity = Information Timestamp = 2024-04-04 16:29:06 Message = System is turning on. FQDD = iDRAC.Embedded.1#HostPowerCtrl -------------------------------------------------------------------------------- SeqNumber = 11314 Message ID = SWC5019 Category = System AgentID = DE Severity = Warning Timestamp = 2024-04-04 16:29:00 Message = Unable to authenticate the BIOS image file because: Internal Errors: Bypassing bios verification and booting the host. Message Arg 1 = Internal Errors: Bypassing bios verification and booting the host -------------------------------------------------------------------------------- SeqNumber = 11313 Message ID = RAC0701 Category = Audit AgentID = RACLOG Severity = Information Timestamp = 2024-04-04 16:27:58 Message = Requested system powerup. FQDD = iDRAC.Embedded.1 -------------------------------------------------------------------------------- SeqNumber = 11312 Message ID = SYS1001 Category = Audit AgentID = DE Severity = Information Timestamp = 2024-04-04 16:26:25 Message = System is turning off. FQDD = iDRAC.Embedded.1#HostPowerCtrl -------------------------------------------------------------------------------- SeqNumber = 11311 Message ID = SYS1003 Category = Audit AgentID = DE Severity = Information Timestamp = 2024-04-04 16:26:25 Message = System CPU Resetting. FQDD = iDRAC.Embedded.1#HostPowerCtrl -------------------------------------------------------------------------------- SeqNumber = 11310 Message ID = NIC100 Category = System AgentID = iDRAC Severity = Warning Timestamp = 2024-04-04 16:26:25 Message = The Embedded NIC 1 Port 1 network link is down. Message Arg 1 = Embedded NIC 1 Message Arg 2 = 1 FQDD = NIC.Embedded.1-1-1
Looked like a host server crash.
Thanks for keeping me up to date, will rearrange the backup sources and dbprov accordingly. s6 eqiad and s5 next ?
Wed, Apr 3
I've started to provision db2198 now due to T361037.
I tested the above patch and it solved the issue:
This was due to https://www.wikimediastatus.net/incidents/7qq1gwnw71jy Services back up. A new ticket will be created for the incident.
Tue, Apr 2
db2100 can be scheduled in advance due to T361037.
Yes, we have redundancy for the backups and this will actually simplify things. I will take care of setting up the new hosts as scheduled for this quarter. Thank you.
Wed, Mar 27
The new shard looking great:
Thank you Reedy, I trust you, it was just that the title wasn't descriptive enough (exact url, logged in/logged out, etc.). The 500 is indeed a symptom of the same issue (http timeouts from varnish). Now I have more data to work with :-D. For example, the first one was from hyperkitty, not postorious, so the title was misleading to me.
I will stop replication on both db2115 and db2215 to mitigate the issue on the primary and extending to the other hosts. That will prevent overwhealming the primary, as the other replicas look of so far. I think this will require a restart of the primary to go to a healthy state, but not touching it for now.
I found something weird with db2196, the slave host table is full of duplicate entries, so I am quite sure that is the problem, not the replicas. Something is weird with the primary, who is killing the replica threads:
It looks like a transient network error or something else causing a connection error (TLS?). It seems to be working now, did you do something?
It seems the RAID controller has gone haywire, as there is no bootable medium, and it is stuck in an endless network boot. The RAID controllers has been mapped out. Could you have a look and request for a repair, if still under warranty? The host can be set offline at any point, but we would like to keep the data on its disks when possible.
@DBu-WMF Hi, we are discussing how to proceed, as handling postmaster access is a new process for us.
Pending approval from Data-Engineering 's list of people that can approve that access: @odimitrijevic @Milimetric @WDoranWMF or @Ahoelzl.
Mar 26 2024
DC Ops, the host crashed and 3 memory banks are mapped out. Can you evaluate the host and either ask for in warranty replacements or any other alternative.
Not the first time it crashed (it rebooted): T283995
@GMikesell-WMF (or @cchen on his behalf)- access has been merged, it may take ~30 minutes to be fully deployed on all servers, after that please check you can access the production hosts and given datasets. Kerberos principal has been also created, and instructions have been automatically sent to email.
Apologies, but this access was already provided back in 2020 at T270438 (https://gerrit.wikimedia.org/r/c/operations/puppet/+/650298), and it is still active. I only realized when I got a duplicate key error.
@Reedy what did you see as slow back them? Right now doing:
I am going to remove the SRE-Access-Requests because, while it is indeed an access request, it is not immediately actionable by people on clinic duty, but has to be discussed with the owners of the workflow (IF) + the rest of the SREs first on how exactly to provide it.