Page MenuHomePhabricator

akosiaris (Alexandros Kosiaris)
Site Reliability EngineerAdministrator

Projects (25)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:40 AM (499 w, 1 h)
Roles
Administrator
Availability
Available
IRC Nick
akosiaris
LDAP User
Alexandros Kosiaris
MediaWiki User
AKosiaris (WMF) [ Global Accounts ]

Blurb

Recent Activity

Yesterday

akosiaris added a comment to T363399: Q4:rack/setup/install parsoidtest1001.

Will parsoidtest1001 be installed with Bullseye? scandium is currently running buster, but all the mediawiki manifests are compatible with bullseye (cloudweb already runs it), and so is the component/php74.

Thu, Apr 25, 4:01 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops
akosiaris updated the task description for T363399: Q4:rack/setup/install parsoidtest1001.
Thu, Apr 25, 4:00 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops
akosiaris updated the task description for T363399: Q4:rack/setup/install parsoidtest1001.
Thu, Apr 25, 4:00 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops
akosiaris added a project to T363402: parsoidtest1001 implementation tracking: Parsoid.

Adding @subbu for their information.

Thu, Apr 25, 1:17 PM · Parsoid, Patch-For-Review, serviceops

Wed, Apr 24

akosiaris added a member for Catalyst: akosiaris.
Wed, Apr 24, 2:43 PM
akosiaris added a comment to T255568: Envoy should listen on ipv6 and ipv4.

Since mesh.configuration 1.7, envoy on WikiKube and other kubernetes clusters listens on IPv6 and IPv4 for both the TLS terminator and the service mesh listeners. Charts are slowly being updated. On the kubernetes side, once all charts are updated, we 'll be done.

Wed, Apr 24, 2:41 PM · Patch-For-Review, envoy, observability, serviceops
akosiaris added a comment to T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.

I 've also just run kubelet 1.23 in standalone mode talking to containerd and indeed processes in containers run with cri-containerd.apparmor.d apparmor profile.

Wed, Apr 24, 2:38 PM · Patch-For-Review, serviceops, Prod-Kubernetes
akosiaris updated the task description for T362408: Migration to containerd and away from docker.
Wed, Apr 24, 2:00 PM · Prod-Kubernetes, Kubernetes, serviceops
akosiaris updated the task description for T362408: Migration to containerd and away from docker.
Wed, Apr 24, 1:59 PM · Prod-Kubernetes, Kubernetes, serviceops
akosiaris added a comment to T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.

Adding for bookwork

Wed, Apr 24, 11:44 AM · Patch-For-Review, serviceops, Prod-Kubernetes
akosiaris added a comment to T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.

Adding as info since it was requested in T362408#9712356

Wed, Apr 24, 11:23 AM · Patch-For-Review, serviceops, Prod-Kubernetes
akosiaris added a comment to T362408: Migration to containerd and away from docker.

@akosiaris could you please double check in your test environment that containerd will still enforce the default apparmor profile (see Remove apparmor.security.beta.kubernetes.io/defaultProfileName in T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21) like docker currently does?

Wed, Apr 24, 11:22 AM · Prod-Kubernetes, Kubernetes, serviceops

Tue, Apr 23

akosiaris closed T363086: ManagementSSHDown parse1002.eqiad.wmnet as Resolved.

I am resolving, hopefully we won't see a recurrence.

Tue, Apr 23, 3:05 PM · SRE, ops-eqiad
akosiaris closed T363086: ManagementSSHDown parse1002.eqiad.wmnet, a subtask of T361396: 1.43.0-wmf.2 deployment blockers, as Resolved.
Tue, Apr 23, 3:04 PM · Patch-For-Review, User-brennen, Release-Engineering-Team (Priority Backlog 📥), Release, Train Deployments
akosiaris added a comment to T363086: ManagementSSHDown parse1002.eqiad.wmnet.

I 've just uncordoned it, it should receive mediawiki payloads in the next deployment. I 've also checked and it's again a scap target for kubernetes-workers group.

Tue, Apr 23, 3:03 PM · SRE, ops-eqiad
akosiaris added a comment to T363086: ManagementSSHDown parse1002.eqiad.wmnet.

@akosiaris @hashar reset idrac with no change i will need to reboot server and hook crash cart up to it. Please advise if i am able to reboot.

Tue, Apr 23, 2:12 PM · SRE, ops-eqiad
akosiaris added a comment to T362681: Provide nodejs20 base images for production.

That's not problem. We should just use the nodesource packages for this, we've been doing the same for "intermediate LTSes" before (e.g. node 16 or node 14) not covered by an intree Debian nodejs version. I'll work on this next week.

Tue, Apr 23, 9:29 AM · serviceops
akosiaris added a comment to T363086: ManagementSSHDown parse1002.eqiad.wmnet.

parse1002.eqiad.wmnet is down / unreachable but is still in the pool of hosts to deploy tool. That has caused the MediaWiki train to fail over night and is causing every MediaWiki deployment to error out due to a timeout when trying reach that host.

Can one please remove the host from the pool of MediaWiki target hosts? Thanks!

Tue, Apr 23, 9:08 AM · SRE, ops-eqiad

Thu, Apr 18

akosiaris updated subscribers of T362681: Provide nodejs20 base images for production.

nodejs20 isn't even on trixie/sid right now https://packages.debian.org/trixie/nodejs, https://packages.debian.org/sid/nodejs but only in experimental.

Thu, Apr 18, 3:43 PM · serviceops
akosiaris added a comment to T120242: Eventually-Consistent MediaWiki state change events | MediaWiki events as source of truth.

Commenting here as well at the request of @Ottomata in T249745#9725953

Thu, Apr 18, 3:37 PM · Data-Engineering, Analytics, DBA, WMF-Architecture-Team, Platform Team Legacy (Later), Event-Platform, Services (later)
akosiaris added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

For replicating state changes (T120242) [...]

Why though? Why is 99.9999% (or 99.999999% or 99.99%) not enough?

There is a "Why do we need this?" section in T120242's description. Let's keep this discussion there?

Thu, Apr 18, 3:37 PM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error

Wed, Apr 17

akosiaris added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

see the CAP theorem

C != eventual-C. Eventual Consistency + AP is feasible and done often.

Wed, Apr 17, 4:39 PM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error
akosiaris added a comment to T362628: Find a way to stage updated OS packages on wikikube.
  • We have staging base images and staging service images updated daily based on what is in Debian and apt.wikimedia.org
Wed, Apr 17, 3:07 PM · Release-Engineering-Team, serviceops, MW-on-K8s, Scap

Mon, Apr 15

akosiaris created T362568: Drop backports from base images.
Mon, Apr 15, 5:47 PM · Infrastructure-Foundations, serviceops
akosiaris created T362567: Add docker production images repo to codesearch.
Mon, Apr 15, 5:40 PM · Patch-For-Review, VPS-project-Codesearch, serviceops
akosiaris lowered the priority of T362518: Deprecate buster-backports from Unbreak Now! to High.

The immediate issue blocking the train has been resolved and new images have been pushed. Hence, lowering to High. There's a tail of images being rebuilt still and it's going to take a while longer, but this is no longer a UBN

Mon, Apr 15, 2:34 PM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops

Fri, Apr 12

akosiaris created T362408: Migration to containerd and away from docker.
Fri, Apr 12, 1:44 PM · Prod-Kubernetes, Kubernetes, serviceops
akosiaris added a comment to T362239: Reformat IRC alerts to be more useful.

Thanks for tackling this!

Fri, Apr 12, 11:12 AM · Patch-For-Review, Observability-Alerting
akosiaris awarded T362239: Reformat IRC alerts to be more useful a Love token.
Fri, Apr 12, 11:03 AM · Patch-For-Review, Observability-Alerting

Thu, Apr 11

akosiaris added a comment to T360907: Can we please add our vendor to Google Postmaster Tools.

I don't think SRE has ever administrated Google Postmaster Tools at all. In fact, a quick cross check in the team showcases almost utter ignorance of the product, although we 'll ask internally a bit more. May I suggest reaching out to ITS too?

Thu, Apr 11, 1:26 PM · SRE-Access-Requests, Fundraising-Backlog

Tue, Apr 9

akosiaris added a comment to T360636: Phase out cergen for ServiceOps services.

I 'll finish parsoid and testreduce in T359387

If I'm not mistaken testreduce is still unrelated, it's for the round trip tests that have been split off to a separate Ganeti VM some time ago (and was moved to Bookworm due to nodejs requirements last year)?

Tue, Apr 9, 9:49 AM · Patch-For-Review, serviceops, Epic, SRE

Mon, Apr 8

akosiaris added a comment to T360636: Phase out cergen for ServiceOps services.

I 'll finish parsoid and testreduce in T359387

Mon, Apr 8, 4:00 PM · Patch-For-Review, serviceops, Epic, SRE
akosiaris added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

This doesn't mean that MediaWiki shoudn't try to improve the situation by handling the failure to submit a job by saving it somewhere (a specific db table?) and we can replay them later. At the current failure rate, this would guarantee the jobs would be executed with an irrelevant cost in terms of resources.

@Joe this sounds sort of similar to the Outbox solution described in T120242, albeit only for failed submissions instead of all of them. Functionally this sounds like a nice solution to the eventual consistency problem described there, but I'd expect it would add some latency to the user response (waiting for ACK from EventGate+Kafka). Actually it sounds more like this (discarded?) solution, except:

Mon, Apr 8, 2:40 PM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error

Thu, Apr 4

akosiaris added a comment to T328036: MCS decommission (2023).

In T361483, I 've been poking into selectively killing parts of changeprop that are no longer used. I am still in the /hopefully easy pickings/ phase, attacking things we KNOW aren't used any more. I am now targetting removing functionality from changeprop that refreshes all the mobile-sections parts of RESTBase, meaning RESTBase will no longer have up to date content for these endpoints.

Thu, Apr 4, 1:09 PM · Essential-Work, Content-Transform-Team-WIP, Mobile-Content-Service
akosiaris added a comment to T361483: Selectively disable changeprop functionality that is no longer used.

Hello!

There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).

You are correct. I 'll post a patch then to remove it. Thanks!

For Lift Wing, we just use CP to call inference.discovery.wmnet, no restbase involved. The idea is to create streams like "for every rev-id, get a score from a Lift Wing model".

This is probably something we want to move away from Changeprop then and in the jobqueue (same software, I know, but a different installation). Looking at the config, I think that there is no code that is specific to LiftWing, just standard reaction to events on kafka.

No problem for me! I can only see one issue, and this is something not specific to our topics: if we start another job in cp-jobqueue, the kafka consumer offset will be reset to whatever is the last element in the topic, and we'll potentially loose events in the stream. It is not a huge deal since at the moment nothing incredibly critical relies on them, but IIRC Search uses one of the running topics to update Elastic Search. If we move everything over we'd need to sync with them and figure out if a "hole" in the stream is acceptable, otherwise the only thing that I can think of is:

  • stop the changeprop rule for the lift wing topic that Search uses.
  • write down the offset of the related consumer group using the kafka api (IIRC it should be possible)
  • create another consumer group in cp-jobqueue with the same initial offset (this is not super difficult but I have never done it).
  • add the rule to cp-jobqueue and check if it works.
Thu, Apr 4, 1:07 PM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTBase Deprecation Roadmap)
akosiaris added a comment to T361483: Selectively disable changeprop functionality that is no longer used.

Next up. mobile-sections. It's deprecated per T328036 for a long time now. I 'll remove rules updating mobile-sections endpoints. That should be fine for external users, we have been returning for many months now 403 to almost everyone (exceptions are still around for Kiwix and Wikiwand, T340036).

Thu, Apr 4, 1:01 PM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTBase Deprecation Roadmap)
akosiaris added a comment to T340036: Setup allowed list for MCS decom.

I guess it's about time I ask if it is ok to remove those exceptions now and return 403 to everyone for these endpoints.

Thu, Apr 4, 12:57 PM · affects-Kiwix-and-openZIM, Content-Transform-Team-WIP, RESTBase Sunsetting, SRE, serviceops, Traffic, Mobile-Content-Service
akosiaris edited projects for T360403: Helm deployment of MediaWiki now takes 6 minutes, added: serviceops-radar; removed serviceops.

Moving it to our radar too as we intend to revisit various parts of all of this (e.g. how we do MultiVersion once we are no longer constrained by the legacy infra), but we don't have something concrete right now.

Thu, Apr 4, 10:06 AM · serviceops-radar, Release-Engineering-Team (Radar), MW-on-K8s
akosiaris updated the task description for T361483: Selectively disable changeprop functionality that is no longer used.
Thu, Apr 4, 8:12 AM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTBase Deprecation Roadmap)

Tue, Apr 2

akosiaris updated subscribers of T309772: npm audit reports several security issues with Service runner.

The last remaining original Services member left in 2022.

Tue, Apr 2, 3:45 PM · MediaWiki-Engineering, CX-cxserver, Security, service-runner
akosiaris added a comment to T360804: macOS aarch64 support.
  • Use qemu to run x86_64 containers on an aarch64 VM
Tue, Apr 2, 1:39 PM · ARM support, Infrastructure-Foundations
akosiaris added a comment to T361483: Selectively disable changeprop functionality that is no longer used.

Hello!

There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).

Tue, Apr 2, 1:11 PM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTBase Deprecation Roadmap)

Mon, Apr 1

akosiaris added projects to T361483: Selectively disable changeprop functionality that is no longer used: ORES, Lift-Wing.

Let's start with the "easy" ones. I see feature flags in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/changeprop/templates/_config.yaml for

Mon, Apr 1, 4:42 PM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTBase Deprecation Roadmap)
akosiaris added a subtask for T262315: <CORE TECHNOLOGY> API Migration & RESTbase Sunset: T361483: Selectively disable changeprop functionality that is no longer used.
Mon, Apr 1, 4:27 PM · API Platform (RESTBase Deprecation Roadmap), Epic, Foundational Technology Requests, Code-Health, Platform Engineering Roadmap, Platform Engineering Roadmap Decision Making
akosiaris added a parent task for T361483: Selectively disable changeprop functionality that is no longer used: T262315: <CORE TECHNOLOGY> API Migration & RESTbase Sunset.
Mon, Apr 1, 4:27 PM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTBase Deprecation Roadmap)
akosiaris created T361483: Selectively disable changeprop functionality that is no longer used.
Mon, Apr 1, 4:27 PM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTBase Deprecation Roadmap)
akosiaris added a comment to T360596: Figure out a plan to move forward with regarding Redis License changes.

LWN has an article titled "The race to replace Redis". I am not going to link directly as it is LWN subscriber only content but I can summarize (note I am pasting links in their entirety on purpose) the "Forks and alternatives" section:

Mon, Apr 1, 12:57 PM · GitLab (Infrastructure), Patch-For-Review, User-aborrero, serviceops, MediaWiki-Platform-Team (Radar), collaboration-services, Release-Engineering-Team (Radar), Quarry, Toolforge, Software-Licensing, Infrastructure-Foundations, netbox, Platform Team Initiatives (API Gateway), ChangeProp, MediaWiki-File-management, SRE

Mar 22 2024

akosiaris added a comment to T358577: Service Ops Review of Metrics Platform Configuration Management UI.

I 've already left various comments on the 2 docs. I am still going through the Miro board, but I can summarize the following:

Mar 22 2024, 12:36 PM · Data Products (Data Products Sprint 12), serviceops
akosiaris added a comment to T358577: Service Ops Review of Metrics Platform Configuration Management UI.

Hi @MShilova_WMF. This is on my list for today, it might spill into early next week though. I 've started the review but I don't see to have access to T358115 (linked from the description), could you please grant me access?

Mar 22 2024, 7:16 AM · Data Products (Data Products Sprint 12), serviceops

Mar 21 2024

akosiaris updated the task description for T360637: Bump memory for registry[12]00[34] VMs.
Mar 21 2024, 2:49 PM · Patch-For-Review, serviceops, Machine-Learning-Team
akosiaris added a comment to T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images.

@akosiaris thanks a lot for all the details, really appreciated, now I have a better understanding of the problem :)

I have a proposal to unblock my team, let me know what you think about it. On the ML side, we are doing the following:

  • Try to reduce the pytorch's size, understanding if we can drop something (for example, support less GPUs etc..). We are logging work in T359569, but I am not super confident that we'll be able to get a significant reduction without coming up with a very complicated and difficult-to-maintain custom build process (like custom Python wheel to store somewhere, long build times to recreate pytorch when needed in CI, etc).
Mar 21 2024, 1:06 PM · Machine-Learning-Team
akosiaris closed T360598: kafka-main certificates expiring on 2024-04-04 as Resolved.

Alerts gone, I 'll resolve this.

Mar 21 2024, 12:22 PM · Data-Platform-SRE, Data-Engineering, serviceops
akosiaris renamed T360594: an-worker1168 in a weird statue, possibly due to I/O errors from an-worker1168 in a weird statue, possiblye due to I/O errors to an-worker1168 in a weird statue, possibly due to I/O errors.
Mar 21 2024, 11:12 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14)
akosiaris added a comment to T360598: kafka-main certificates expiring on 2024-04-04.

So, since I 've never done this before (that I remember of), double check me on this please. Is it just enough to issue

Mar 21 2024, 9:33 AM · Data-Platform-SRE, Data-Engineering, serviceops
akosiaris added a comment to T360598: kafka-main certificates expiring on 2024-04-04.
brouberol@kafka-main2001:~$ echo y | openssl s_client -connect $(hostname -f):9093  | openssl x509 -issuer -nout
x509: Unrecognized flag nout
x509: Use -help for summary.
depth=2 C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = Wikimedia_Internal_Root_CA
verify return:1
depth=1 C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = kafka
verify return:1
depth=0 CN = kafka-main2001.codfw.wmnet
verify return:1
DONE

The runbook mentions

If the CA mentioned is:

  • the Puppet one, then you'll need to follow Cergen#Update_a_certificate and deploy the new certificate to all nodes.
  • the Kafka PKI Intermediate one, then in theory a new certificate should be issued few days before the expiry and puppet should replace the Kafka keystore automatically (under /etc/kafka/ssl).

@akosiaris do you happen to know which one it is in that case? It's not obvious to me. Thanks!

I'd tend to say Kafka PKI Intermediate due to depth=1 CN=kafka but a confirmation would be perfect.

Mar 21 2024, 9:31 AM · Data-Platform-SRE, Data-Engineering, serviceops
akosiaris triaged T360598: kafka-main certificates expiring on 2024-04-04 as High priority.

Adding @brouberol as they probably have way more experience than serviceops on refreshing kafka certificates than anyone in serviceops

Mar 21 2024, 9:16 AM · Data-Platform-SRE, Data-Engineering, serviceops
akosiaris created T360598: kafka-main certificates expiring on 2024-04-04.
Mar 21 2024, 9:14 AM · Data-Platform-SRE, Data-Engineering, serviceops
akosiaris added a project to T360596: Figure out a plan to move forward with regarding Redis License changes: netbox.
Mar 21 2024, 8:59 AM · GitLab (Infrastructure), Patch-For-Review, User-aborrero, serviceops, MediaWiki-Platform-Team (Radar), collaboration-services, Release-Engineering-Team (Radar), Quarry, Toolforge, Software-Licensing, Infrastructure-Foundations, netbox, Platform Team Initiatives (API Gateway), ChangeProp, MediaWiki-File-management, SRE
akosiaris created T360596: Figure out a plan to move forward with regarding Redis License changes.
Mar 21 2024, 8:58 AM · GitLab (Infrastructure), Patch-For-Review, User-aborrero, serviceops, MediaWiki-Platform-Team (Radar), collaboration-services, Release-Engineering-Team (Radar), Quarry, Toolforge, Software-Licensing, Infrastructure-Foundations, netbox, Platform Team Initiatives (API Gateway), ChangeProp, MediaWiki-File-management, SRE
akosiaris added a comment to T360594: an-worker1168 in a weird statue, possibly due to I/O errors.

Related alerts in alerts.wikimedia.org have been silenced from 30 days (chosen arbitrarily) with a comment pointing to this task.

Mar 21 2024, 8:02 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14)
akosiaris updated the task description for T360594: an-worker1168 in a weird statue, possibly due to I/O errors.
Mar 21 2024, 7:54 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14)
akosiaris created T360594: an-worker1168 in a weird statue, possibly due to I/O errors.
Mar 21 2024, 7:51 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14)

Mar 20 2024

akosiaris added a comment to T358738: Commons thumbnails are broken for certain large sizes of thumbnail images.

I thought there was no cross-DC replication of thumbnails. T299125#8221206 seems to support that. So it's expected that a bad file created by T344233 would only affect one swift DC.

Mar 20 2024, 9:00 AM · SRE-swift-storage, serviceops, Commons

Mar 19 2024

akosiaris added a comment to T357547: ☂️ Northward Datacentre Switchover (March 2024) .

We had to repool kartotherian in codfw as we had a CPU exhaustion event in eqiad right after the services switchover. Since some kartotherian endpoints create an amplification effect to kartotherian itself, we opted for restarting kartotherian in eqiad to fix that.

Mar 19 2024, 3:56 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic
akosiaris added a comment to T358738: Commons thumbnails are broken for certain large sizes of thumbnail images.

ping @akosiaris Ideas on why codfw is out of date and won't correct ? Is it out of rotation or something ?

Mar 19 2024, 9:29 AM · SRE-swift-storage, serviceops, Commons
akosiaris added a comment to T360403: Helm deployment of MediaWiki now takes 6 minutes.

I wanted to point out that as the migration progresses and the size of MediaWiki deployments in WikiKube increases, it is inevitable that the deployment times for MW-on-K8s will increase too. Right now, we upgrade to each new version in chunks of 3% (16d6e717a7a) of the total. This is a relatively latest development, in the past we upgraded in larger chunks, since the overall size of each deployment was smaller. I expect those numbers to increase more, but I also expect the numbers for scap deploying to "legacy" infrastructure to decrease. Not proportionally of course.

Mar 19 2024, 9:23 AM · serviceops-radar, Release-Engineering-Team (Radar), MW-on-K8s

Mar 17 2024

MSantos awarded T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it a Love token.
Mar 17 2024, 1:51 PM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Mar 10 2024

akosiaris added a comment to T323169: Internal Server Errors from Zotero with nytimes.com.

Is the HTTP response body for those 403s saved anywhere?

Mar 10 2024, 8:11 AM · Citoid, Cite, VisualEditor

Mar 8 2024

akosiaris added a comment to T323169: Internal Server Errors from Zotero with nytimes.com.

Zotero is using url downloader to access the internet. It's logs end up in logstash e.g.

Mar 8 2024, 2:32 PM · Citoid, Cite, VisualEditor
akosiaris added a comment to T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images.
Mar 8 2024, 1:18 PM · Machine-Learning-Team
akosiaris added a comment to T256762: Fix nginx config and caching for docker registry .

@JMeybohm is there anything left to dohere? I think we can resolve.

Mar 8 2024, 12:53 PM · serviceops, Kubernetes, SRE
akosiaris merged task T307797: Clean-up / delete old versions of service pipeline created docker images from the public docker registry? into T242604: Remove obsoleted docker images.
Mar 8 2024, 12:51 PM · User-MoritzMuehlenhoff, Release Pipeline, serviceops
akosiaris merged T307797: Clean-up / delete old versions of service pipeline created docker images from the public docker registry? into T242604: Remove obsoleted docker images.
Mar 8 2024, 12:51 PM · Release-Engineering-Team (Radar), Upstream, User-brennen, SRE, Release Pipeline, serviceops

Mar 6 2024

akosiaris triaged T359387: Cleanup parsoid-php service as Low priority.
Mar 6 2024, 2:54 PM · Parsoid (Tracking), Patch-For-Review, serviceops
akosiaris closed T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it, a subtask of T290536: Serve production traffic via Kubernetes, as Resolved.
Mar 6 2024, 2:34 PM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
akosiaris closed T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it as Resolved.
Mar 6 2024, 2:34 PM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
akosiaris created T359387: Cleanup parsoid-php service.
Mar 6 2024, 2:33 PM · Parsoid (Tracking), Patch-For-Review, serviceops
akosiaris closed T358752: Reimage parse* hosts as kubernetes nodes as Resolved.
Mar 6 2024, 2:26 PM · Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
akosiaris closed T358752: Reimage parse* hosts as kubernetes nodes, a subtask of T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it, as Resolved.
Mar 6 2024, 2:24 PM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
akosiaris added a comment to T358752: Reimage parse* hosts as kubernetes nodes.

Almost all parsoid hosts have been reimaged as kubernetes nodes. Scandium, testreduce1002, parse1001 and parse1002 being the exceptions. The former 2 because it was requested in T357392#9546852, the other 2 because we don't want to mess with the state of parsoid-php right before the SRE summit and DC switchover. I 'll reword this task a bit and then resolve it and file a cleanup follow up task for the last 2 nodes to reimage and related cleanups.

Mar 6 2024, 2:23 PM · Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
akosiaris updated the task description for T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it.
Mar 6 2024, 2:20 PM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
akosiaris updated the task description for T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it.
Mar 6 2024, 7:37 AM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Mar 5 2024

akosiaris added a comment to T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images.

So this is a difficult one to tackle. From what I gather images (and layers) can end up being really large, close to 10GB. I have questions regarding how a pip install ends up consuming 10GB of disk space of course but the main issue here is probably that this is going to cause issue down the road anyway. So that is probably unsustainable long term.

Mar 5 2024, 4:40 PM · Machine-Learning-Team
akosiaris added a comment to T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it.

We at ~50% mw-parsoid right now.

Mar 5 2024, 2:41 PM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
akosiaris updated the task description for T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it.
Mar 5 2024, 2:38 PM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
akosiaris closed T359114: Slow and failed deployments as Resolved.

I 've added another 220 CPUs for codfw and 300 for eqiad, we should be good on this front. I 'll resolve in the interest of sparing someone else from doing so, feel free to reopen.

Mar 5 2024, 10:33 AM · serviceops, MW-on-K8s
akosiaris closed T359114: Slow and failed deployments, a subtask of T354439: 1.42.0-wmf.21 deployment blockers, as Resolved.
Mar 5 2024, 10:32 AM · Release-Engineering-Team (Now this 🫠), Release, Train Deployments
akosiaris added a comment to T359114: Slow and failed deployments.

I 've accounted for the cordoned nodes and indeed...

Mar 5 2024, 9:54 AM · serviceops, MW-on-K8s
akosiaris added a comment to T359114: Slow and failed deployments.

I 've crafted this dashboard https://grafana-rw.wikimedia.org/d/pz5A-vASz/kubernetes-resources?orgId=1&var-ds=thanos&var-site=codfw&var-prometheus=k8s&from=now-24h&to=now

Mar 5 2024, 8:33 AM · serviceops, MW-on-K8s
akosiaris added a comment to T359114: Slow and failed deployments.

Looking at logstash in the Kubernetes events dashboard and fiddling a bit with the filtering I finally see

Mar 5 2024, 7:37 AM · serviceops, MW-on-K8s

Feb 29 2024

akosiaris created T358752: Reimage parse* hosts as kubernetes nodes.
Feb 29 2024, 10:59 AM · Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
akosiaris updated the task description for T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it.
Feb 29 2024, 8:13 AM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Feb 27 2024

akosiaris triaged T358588: A lot of `[info] Wikitext for this page has duplicate ids:` in logstash for mw-parsoid. Possibly related to PageBundle as High priority.
Feb 27 2024, 2:24 PM · Essential-Work, Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
akosiaris created T358588: A lot of `[info] Wikitext for this page has duplicate ids:` in logstash for mw-parsoid. Possibly related to PageBundle.
Feb 27 2024, 2:24 PM · Essential-Work, Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
akosiaris added a comment to T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it.

Migration started, we are batch 1 for the next few days.

Feb 27 2024, 11:52 AM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
akosiaris updated the task description for T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it.
Feb 27 2024, 11:52 AM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
akosiaris added a comment to T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it.

The LVS traffic approach was doomed to fail, since scap utilizes the same data structure to figure out which hosts to deploy to. I 've re-ran numbers on services_proxy and parsoid cluster to make sure I ain't missing anything and it appears that indeed the only direct client is RESTBase and monitoring/healthchecks. So, the services_proxy approach should work fine. I 've updated the plan in the task and I 'll start executing it.

Feb 27 2024, 11:16 AM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
akosiaris updated the task description for T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it.
Feb 27 2024, 11:14 AM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Feb 26 2024

akosiaris awarded T345274: Remove similar-users service from k8s a Love token.
Feb 26 2024, 12:57 PM · Patch-For-Review, Similarusers, serviceops

Feb 22 2024

akosiaris renamed T357392: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it from Create parsoid mediawiki deployment to Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it.
Feb 22 2024, 3:11 PM · Patch-For-Review, Content-Transform-Team, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s