Page MenuHomePhabricator

SREGroup
ActivePublic

Recent Activity

Today

dcausse lowered the priority of T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw from Unbreak Now! to Medium.

completion traffic is now served from codfw which has proper indices, lowering prio

Fri, Apr 26, 10:03 AM · CirrusSearch, Discovery-Search (Current work), Patch-For-Review, SRE
hashar placed T363086: ManagementSSHDown parse1002.eqiad.wmnet up for grabs.

Removing assignee that was automatically set by Phabricator when the task got marked as resolved.

Fri, Apr 26, 9:56 AM · SRE, ops-eqiad
hashar merged T363551: ManagementSSHDown into T363086: ManagementSSHDown parse1002.eqiad.wmnet.
Fri, Apr 26, 9:55 AM · SRE, ops-eqiad
hashar reopened T363086: ManagementSSHDown parse1002.eqiad.wmnet as "Open".

scap does the docker pull on any of the k8s worker as defined by the kubernetes-workers group and parse1002 is n that group:

deploy1002$ grep -R parse1002 /etc/dsh/group
/etc/dsh/group/kubernetes-workers:parse1002.eqiad.wmnet
Fri, Apr 26, 9:55 AM · SRE, ops-eqiad
hashar merged task T363551: ManagementSSHDown into T363086: ManagementSSHDown parse1002.eqiad.wmnet.
Fri, Apr 26, 9:54 AM · SRE, ops-eqiad
Stashbot added a comment to T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw.

Mentioned in SAL (#wikimedia-operations) [2024-04-26T09:54:21Z] <dcausse@deploy1002> Finished scap: Backport for [[gerrit:1024478|cirrus: Shift autocomplete traffic to codfw (T363516)]] (duration: 17m 57s)

Fri, Apr 26, 9:54 AM · CirrusSearch, Discovery-Search (Current work), Patch-For-Review, SRE
Stashbot added a comment to T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw.

Mentioned in SAL (#wikimedia-operations) [2024-04-26T09:41:30Z] <dcausse@deploy1002> dcausse and ebernhardson: Backport for [[gerrit:1024478|cirrus: Shift autocomplete traffic to codfw (T363516)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Fri, Apr 26, 9:41 AM · CirrusSearch, Discovery-Search (Current work), Patch-For-Review, SRE
Stashbot added a comment to T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw.

Mentioned in SAL (#wikimedia-operations) [2024-04-26T09:36:24Z] <dcausse@deploy1002> Started scap: Backport for [[gerrit:1024478|cirrus: Shift autocomplete traffic to codfw (T363516)]]

Fri, Apr 26, 9:36 AM · CirrusSearch, Discovery-Search (Current work), Patch-For-Review, SRE
gerritbot added a comment to T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw.

Change #1024478 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: Shift autocomplete traffic to codfw

https://gerrit.wikimedia.org/r/1024478

Fri, Apr 26, 9:36 AM · CirrusSearch, Discovery-Search (Current work), Patch-For-Review, SRE
dcausse triaged T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw as Unbreak Now! priority.

This is still happening, raising to UBN

Fri, Apr 26, 9:16 AM · CirrusSearch, Discovery-Search (Current work), Patch-For-Review, SRE
gerritbot added a comment to T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw.

Change #1024478 restored by DCausse:

[operations/mediawiki-config@master] cirrus: Shift autocomplete traffic to codfw

https://gerrit.wikimedia.org/r/1024478

Fri, Apr 26, 9:06 AM · CirrusSearch, Discovery-Search (Current work), Patch-For-Review, SRE
ops-monitoring-bot added a comment to T363310: Site: codfw 1 VM request for staging-codfw kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2003.codfw.wmnet with OS bullseye completed:

  • kubestagemaster2003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404260748_jayme_2095943_kubestagemaster2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Fri, Apr 26, 8:33 AM · Patch-For-Review, vm-requests, Infrastructure-Foundations, SRE, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T363310: Site: codfw 1 VM request for staging-codfw kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster2003.codfw.wmnet with OS bullseye

Fri, Apr 26, 7:30 AM · Patch-For-Review, vm-requests, Infrastructure-Foundations, SRE, serviceops, Prod-Kubernetes, Kubernetes
Maintenance_bot added a project to T363551: ManagementSSHDown: SRE.
Fri, Apr 26, 7:29 AM · SRE, ops-eqiad
gerritbot added a comment to T363310: Site: codfw 1 VM request for staging-codfw kube-apiserver.

Change #1024543 merged by JMeybohm:

[operations/puppet@production] kubestagemaster2003: Add as insetup::serviceops

https://gerrit.wikimedia.org/r/1024543

Fri, Apr 26, 7:24 AM · Patch-For-Review, vm-requests, Infrastructure-Foundations, SRE, serviceops, Prod-Kubernetes, Kubernetes
gerritbot added a project to T363310: Site: codfw 1 VM request for staging-codfw kube-apiserver: Patch-For-Review.
Fri, Apr 26, 7:21 AM · Patch-For-Review, vm-requests, Infrastructure-Foundations, SRE, serviceops, Prod-Kubernetes, Kubernetes
gerritbot added a comment to T363310: Site: codfw 1 VM request for staging-codfw kube-apiserver.

Change #1024543 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubestagemaster2003: Add as insetup::serviceops

https://gerrit.wikimedia.org/r/1024543

Fri, Apr 26, 7:21 AM · Patch-For-Review, vm-requests, Infrastructure-Foundations, SRE, serviceops, Prod-Kubernetes, Kubernetes

Yesterday

matmarex added a comment to T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw.

Never mind, they just said it's fixed :)

Thu, Apr 25, 9:58 PM · CirrusSearch, Discovery-Search (Current work), Patch-For-Review, SRE
matmarex added a comment to T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw.

There's someone reporting that they're still not seeing the expected results for some queries, although I can't reproduce: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#c-2804:F14:8092:9F01:3468:323E:5807:DBA8-20240425214000-Matma_Rex-20240425212500

Thu, Apr 25, 9:58 PM · CirrusSearch, Discovery-Search (Current work), Patch-For-Review, SRE
gerritbot added a comment to T362421: magru network setup.

Change #1024516 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] magru: update edgeuno transit IP

https://gerrit.wikimedia.org/r/1024516

Thu, Apr 25, 9:43 PM · Patch-For-Review, netops, SRE, Infrastructure-Foundations
ops-monitoring-bot created T363522: Degraded RAID on aqs1014.
Thu, Apr 25, 8:43 PM · SRE, ops-eqiad
gerritbot added a comment to T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw.

Change #1024478 abandoned by Ebernhardson:

[operations/mediawiki-config@master] cirrus: Shift autocomplete traffic to codfw

Reason:

rebuild only took 45 minutes, decided not to shuffle while it was in progress

https://gerrit.wikimedia.org/r/1024478

Thu, Apr 25, 8:28 PM · CirrusSearch, Discovery-Search (Current work), Patch-For-Review, SRE
EBernhardson added a comment to T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw.

Decided against shuffling traffic, rebuild is almost compete already for enwiki. I can see in the logs where the enwiki eqiad build jumped from 44% to complete, but no reason why. nothing in logstash for that period either. I've created T363521 to put something in place to prevent this in the future.

Thu, Apr 25, 8:20 PM · CirrusSearch, Discovery-Search (Current work), Patch-For-Review, SRE
gerritbot added a comment to T360439: Phase out cergen for Search Platform services.

Change #1024481 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elasticsearch: Configure alerts for short-lived certs

https://gerrit.wikimedia.org/r/1024481

Thu, Apr 25, 7:44 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05), SRE
Stashbot added a comment to T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw.

Mentioned in SAL (#wikimedia-operations) [2024-04-25T19:33:12Z] <ebernhardson> T363516 started manual rebuild of enwiki titlesuggest indices in eqiad

Thu, Apr 25, 7:33 PM · CirrusSearch, Discovery-Search (Current work), Patch-For-Review, SRE
Gehel moved T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw from Incoming to In Progress on the Discovery-Search (Current work) board.
Thu, Apr 25, 7:33 PM · CirrusSearch, Discovery-Search (Current work), Patch-For-Review, SRE
Gehel added a project to T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw: Discovery-Search (Current work).
Thu, Apr 25, 7:32 PM · CirrusSearch, Discovery-Search (Current work), Patch-For-Review, SRE
gerritbot added a project to T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw: Patch-For-Review.
Thu, Apr 25, 7:31 PM · CirrusSearch, Discovery-Search (Current work), Patch-For-Review, SRE
gerritbot added a comment to T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw.

Change #1024478 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Shift autocomplete traffic to codfw

https://gerrit.wikimedia.org/r/1024478

Thu, Apr 25, 7:31 PM · CirrusSearch, Discovery-Search (Current work), Patch-For-Review, SRE
EBernhardson added a comment to T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw.

hmm, i can confirm this is happening. The completion index is built new every day in each datacenter. Usually they are the same, but somehow the eqiad index is about half the size of the codfw index (6.7g vs 14.5g). Auto complete is fairly high traffic, we should probably shift the autocomplete traffic to codfw until it can be fixed which probably requires a rebuild and a couple hours.

Thu, Apr 25, 7:25 PM · CirrusSearch, Discovery-Search (Current work), Patch-For-Review, SRE
matmarex added a project to T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw: SRE.
Thu, Apr 25, 7:01 PM · CirrusSearch, Discovery-Search (Current work), Patch-For-Review, SRE
Isaac updated subscribers of T363514: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access).

@YLiou_WMF here's the task -- please sign L3

Thu, Apr 25, 6:54 PM · SRE, SRE-Access-Requests
Isaac created T363514: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access).
Thu, Apr 25, 6:51 PM · SRE, SRE-Access-Requests
gerritbot added a comment to T360414: Phase out cergen for Observability services.

Change #1023917 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] prometheus: Ensure TLS certificates are provided by CFSSL

https://gerrit.wikimedia.org/r/1023917

Thu, Apr 25, 6:49 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), observability, SRE
Dzahn added a comment to T362959: Grant Access to NDA for lina.farid.

Thanks! I added Lina to WMF-NDA in Phabricator for access to private tickets.

Thu, Apr 25, 5:50 PM · Patch-For-Review, WMF-NDA-Requests, SRE, LDAP-Access-Requests
KFrancis added a comment to T362959: Grant Access to NDA for lina.farid.

@Dzahn Done!

Thu, Apr 25, 5:42 PM · Patch-For-Review, WMF-NDA-Requests, SRE, LDAP-Access-Requests
Dzahn added a comment to T362959: Grant Access to NDA for lina.farid.

@KFrancis Could you add Lina to the 'NDA and MOU' spreadsheet please? That way Brett can see the email address and we don't run into sync issues like with T358578. Thanks

Thu, Apr 25, 5:35 PM · Patch-For-Review, WMF-NDA-Requests, SRE, LDAP-Access-Requests
gerritbot added a project to T362959: Grant Access to NDA for lina.farid: Patch-For-Review.
Thu, Apr 25, 4:59 PM · Patch-For-Review, WMF-NDA-Requests, SRE, LDAP-Access-Requests
gerritbot added a comment to T362959: Grant Access to NDA for lina.farid.

Change #1024449 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] admin: add Linda Farid to LDAP_only (nda)

https://gerrit.wikimedia.org/r/1024449

Thu, Apr 25, 4:59 PM · Patch-For-Review, WMF-NDA-Requests, SRE, LDAP-Access-Requests
gerritbot added a comment to T363415: upgrade deployment servers to bullseye / add bullseye support to puppet role.

Change #1024447 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] deployment_server: stop including redis::client::python

https://gerrit.wikimedia.org/r/1024447

Thu, Apr 25, 4:42 PM · Patch-For-Review, serviceops, SRE
gerritbot added a comment to T135991: Automated service restarts for common low-level system services.

Change #1024336 merged by Dzahn:

[operations/puppet@production] releases: Enable profile::auto_restarts::service for docker/containerd

https://gerrit.wikimedia.org/r/1024336

Thu, Apr 25, 4:29 PM · Patch-For-Review, Performance-Team (Radar), SRE
BCornwall moved T362959: Grant Access to NDA for lina.farid from Backlog to Awaiting User Input on the LDAP-Access-Requests board.

Hi, @Lina_Farid_WMDE, thanks for signing that. Could you share your email address so I can get a patch in?

Thu, Apr 25, 4:21 PM · Patch-For-Review, WMF-NDA-Requests, SRE, LDAP-Access-Requests
gerritbot added a comment to T363415: upgrade deployment servers to bullseye / add bullseye support to puppet role.

Change #1023954 merged by Dzahn:

[operations/puppet@production] redis: use python3-redis to support bullseye

https://gerrit.wikimedia.org/r/1023954

Thu, Apr 25, 4:12 PM · Patch-For-Review, serviceops, SRE
akosiaris added a comment to T363399: Q4:rack/setup/install parsoidtest1001.

Will parsoidtest1001 be installed with Bullseye? scandium is currently running buster, but all the mediawiki manifests are compatible with bullseye (cloudweb already runs it), and so is the component/php74.

Thu, Apr 25, 4:01 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops
LSobanski added a comment to T331706: Migrate Mailman/lists to Bullseye/Bookworm.

Updating the host ownership in the Puppet role should also be part of this task.

Thu, Apr 25, 4:00 PM · collaboration-services, Wikimedia-Mailing-lists, SRE
akosiaris updated the task description for T363399: Q4:rack/setup/install parsoidtest1001.
Thu, Apr 25, 4:00 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops
akosiaris updated the task description for T363399: Q4:rack/setup/install parsoidtest1001.
Thu, Apr 25, 4:00 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops
gerritbot added a comment to T325398: Postfix MTA Profile.

Change #1019131 merged by JHathaway:

[operations/puppet@production] Postfix profile

https://gerrit.wikimedia.org/r/1019131

Thu, Apr 25, 3:51 PM · Patch-For-Review, Infrastructure-Foundations, Mail, SRE
jcrespo added a comment to T361087: backup1005 crashed.

In any case, at this point I 'd prefer to do an in-place upgrade rather than a reimage, given how unreliable a reimage is and how impactful it can be for stateful services.

Thu, Apr 25, 3:38 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
MoritzMuehlenhoff added a comment to T361087: backup1005 crashed.

Booting failed (PXE):

PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al


Debian 12 (bookworm) amd64 (Wikimedia edition)

                                              boot: 
Loading debian-installer/amd64/linux... ok
Loading debian-installer/amd64/initrd.gz...
Boot failed: press a key to retry, or wait for reset...

Hmm. Not sure if we've seen this problem before. DHCP clearly worked as did the debian image download, but Linux failed to load for some reason.

@jcrespo the only difference was selecting bullseye rather than bookworm on the second attempt?

Yes. Check with @MoritzMuehlenhoff he did something to fix something, but not sure what, or if it applies here.

Thu, Apr 25, 3:36 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups