Blurb
User Details
- User Since
- Oct 3 2014, 8:40 AM (499 w, 1 h)
- Roles
- Administrator
- Availability
- Available
- IRC Nick
- akosiaris
- LDAP User
- Alexandros Kosiaris
- MediaWiki User
- AKosiaris (WMF) [ Global Accounts ]
Yesterday
Adding @subbu for their information.
Wed, Apr 24
Since mesh.configuration 1.7, envoy on WikiKube and other kubernetes clusters listens on IPv6 and IPv4 for both the TLS terminator and the service mesh listeners. Charts are slowly being updated. On the kubernetes side, once all charts are updated, we 'll be done.
I 've also just run kubelet 1.23 in standalone mode talking to containerd and indeed processes in containers run with cri-containerd.apparmor.d apparmor profile.
Adding for bookwork
Adding as info since it was requested in T362408#9712356
Tue, Apr 23
I am resolving, hopefully we won't see a recurrence.
I 've just uncordoned it, it should receive mediawiki payloads in the next deployment. I 've also checked and it's again a scap target for kubernetes-workers group.
Thu, Apr 18
nodejs20 isn't even on trixie/sid right now https://packages.debian.org/trixie/nodejs, https://packages.debian.org/sid/nodejs but only in experimental.
Commenting here as well at the request of @Ottomata in T249745#9725953
Wed, Apr 17
- We have staging base images and staging service images updated daily based on what is in Debian and apt.wikimedia.org
Mon, Apr 15
The immediate issue blocking the train has been resolved and new images have been pushed. Hence, lowering to High. There's a tail of images being rebuilt still and it's going to take a while longer, but this is no longer a UBN
Fri, Apr 12
Thanks for tackling this!
Thu, Apr 11
I don't think SRE has ever administrated Google Postmaster Tools at all. In fact, a quick cross check in the team showcases almost utter ignorance of the product, although we 'll ask internally a bit more. May I suggest reaching out to ITS too?
Tue, Apr 9
Mon, Apr 8
I 'll finish parsoid and testreduce in T359387
Thu, Apr 4
In T361483, I 've been poking into selectively killing parts of changeprop that are no longer used. I am still in the /hopefully easy pickings/ phase, attacking things we KNOW aren't used any more. I am now targetting removing functionality from changeprop that refreshes all the mobile-sections parts of RESTBase, meaning RESTBase will no longer have up to date content for these endpoints.
I guess it's about time I ask if it is ok to remove those exceptions now and return 403 to everyone for these endpoints.
Moving it to our radar too as we intend to revisit various parts of all of this (e.g. how we do MultiVersion once we are no longer constrained by the legacy infra), but we don't have something concrete right now.
Tue, Apr 2
- Use qemu to run x86_64 containers on an aarch64 VM
Mon, Apr 1
Let's start with the "easy" ones. I see feature flags in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/changeprop/templates/_config.yaml for
LWN has an article titled "The race to replace Redis". I am not going to link directly as it is LWN subscriber only content but I can summarize (note I am pasting links in their entirety on purpose) the "Forks and alternatives" section:
Mar 22 2024
I 've already left various comments on the 2 docs. I am still going through the Miro board, but I can summarize the following:
Hi @MShilova_WMF. This is on my list for today, it might spill into early next week though. I 've started the review but I don't see to have access to T358115 (linked from the description), could you please grant me access?
Mar 21 2024
Alerts gone, I 'll resolve this.
So, since I 've never done this before (that I remember of), double check me on this please. Is it just enough to issue
Adding @brouberol as they probably have way more experience than serviceops on refreshing kafka certificates than anyone in serviceops
Related alerts in alerts.wikimedia.org have been silenced from 30 days (chosen arbitrarily) with a comment pointing to this task.
Mar 20 2024
Mar 19 2024
We had to repool kartotherian in codfw as we had a CPU exhaustion event in eqiad right after the services switchover. Since some kartotherian endpoints create an amplification effect to kartotherian itself, we opted for restarting kartotherian in eqiad to fix that.
I wanted to point out that as the migration progresses and the size of MediaWiki deployments in WikiKube increases, it is inevitable that the deployment times for MW-on-K8s will increase too. Right now, we upgrade to each new version in chunks of 3% (16d6e717a7a) of the total. This is a relatively latest development, in the past we upgraded in larger chunks, since the overall size of each deployment was smaller. I expect those numbers to increase more, but I also expect the numbers for scap deploying to "legacy" infrastructure to decrease. Not proportionally of course.
Mar 17 2024
Mar 10 2024
Mar 8 2024
Zotero is using url downloader to access the internet. It's logs end up in logstash e.g.
@JMeybohm is there anything left to dohere? I think we can resolve.
Mar 6 2024
Almost all parsoid hosts have been reimaged as kubernetes nodes. Scandium, testreduce1002, parse1001 and parse1002 being the exceptions. The former 2 because it was requested in T357392#9546852, the other 2 because we don't want to mess with the state of parsoid-php right before the SRE summit and DC switchover. I 'll reword this task a bit and then resolve it and file a cleanup follow up task for the last 2 nodes to reimage and related cleanups.
Mar 5 2024
So this is a difficult one to tackle. From what I gather images (and layers) can end up being really large, close to 10GB. I have questions regarding how a pip install ends up consuming 10GB of disk space of course but the main issue here is probably that this is going to cause issue down the road anyway. So that is probably unsustainable long term.
We at ~50% mw-parsoid right now.
I 've added another 220 CPUs for codfw and 300 for eqiad, we should be good on this front. I 'll resolve in the interest of sparing someone else from doing so, feel free to reopen.
I 've accounted for the cordoned nodes and indeed...
Looking at logstash in the Kubernetes events dashboard and fiddling a bit with the filtering I finally see
Feb 29 2024
Feb 27 2024
Migration started, we are batch 1 for the next few days.
The LVS traffic approach was doomed to fail, since scap utilizes the same data structure to figure out which hosts to deploy to. I 've re-ran numbers on services_proxy and parsoid cluster to make sure I ain't missing anything and it appears that indeed the only direct client is RESTBase and monitoring/healthchecks. So, the services_proxy approach should work fine. I 've updated the plan in the task and I 'll start executing it.