Page MenuHomePhabricator

CDanis (Chris Danis)
SRE @ WMF

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Nov 5 2018, 2:54 PM (285 w, 3 d)
Availability
Available
IRC Nick
cdanis
LDAP User
CDanis
MediaWiki User
CDanis (WMF) [ Global Accounts ]

Recent Activity

Yesterday

CDanis added a comment to T363407: Proper service names in trace data.

BTW in case it was not clear, my intentions here are basically:

  • deploy something ASAP (like next week) that everyone is reasonably happy with for the interim
  • don't do anything to get in the way of the badly-needed Envoy upgrade
  • don't break anything else
Thu, Apr 25, 4:23 PM · Observability-Tracing
CDanis added a comment to T363407: Proper service names in trace data.

Thanks for the write-up!
What is not very clear to me is what part of the work would need to be done anyways (in case we'd have a envoy version >= 1.24). The reason I'm asking this is that envoy 1.23 is EOL since a year or so, so we need to look at an upgrade anyways.

Thu, Apr 25, 2:19 PM · Observability-Tracing

Wed, Apr 24

Ladsgroup awarded T363407: Proper service names in trace data a Love token.
Wed, Apr 24, 8:06 PM · Observability-Tracing
CDanis added a subtask for T320549: distributed tracing v0 [minimum viable]: T363407: Proper service names in trace data.
Wed, Apr 24, 8:03 PM · Epic, Observability-Tracing
CDanis added a parent task for T363407: Proper service names in trace data: T320549: distributed tracing v0 [minimum viable].
Wed, Apr 24, 8:03 PM · Observability-Tracing
CDanis created T363407: Proper service names in trace data.
Wed, Apr 24, 8:03 PM · Observability-Tracing

Thu, Apr 18

CDanis reopened T360029: Integrate dbctl IP changes as part of VLAN changes. as "Open".
Thu, Apr 18, 2:38 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations
CDanis reopened T360029: Integrate dbctl IP changes as part of VLAN changes. , a subtask of T354878: Re-IP db servers in codfw row A/B moving to per-rack subnets, as Open.
Thu, Apr 18, 2:37 PM · Data-Persistence, SRE, Infrastructure-Foundations
CDanis closed T360029: Integrate dbctl IP changes as part of VLAN changes. as Resolved.

Anyway I think that all that is needed to unblock VLAN migrations has been done or documented on this ticket? Optimistically closing but please re-open if you disagree.

Thu, Apr 18, 2:20 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations
CDanis closed T360029: Integrate dbctl IP changes as part of VLAN changes. , a subtask of T354878: Re-IP db servers in codfw row A/B moving to per-rack subnets, as Resolved.
Thu, Apr 18, 2:19 PM · Data-Persistence, SRE, Infrastructure-Foundations
CDanis added a comment to T360029: Integrate dbctl IP changes as part of VLAN changes. .

As for the commit I advocate to add dbctl support in Spicerack but IIRC that requires changes in dbctl as most of its logic is in its CLI part and not exposed as a library, but to be checked.

Thu, Apr 18, 2:19 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations
CDanis triaged T362893: Spicerack support for dbctl as Low priority.
Thu, Apr 18, 2:17 PM · Infrastructure-Foundations, conftool, Spicerack, SRE-tools
CDanis created T362893: Spicerack support for dbctl.
Thu, Apr 18, 2:16 PM · Infrastructure-Foundations, conftool, Spicerack, SRE-tools

Wed, Apr 17

CDanis added a comment to T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps.

I largely agree with Arzhel's assessment. At a cursory glance, Uruguay or Paraguay look ideal as first candidates.

Wed, Apr 17, 7:12 PM · Infrastructure-Foundations, SRE, Traffic
CDanis added a comment to T360029: Integrate dbctl IP changes as part of VLAN changes. .

I think you should be able to use the existing spicerack interface to confctl to do the set/host_ip=... action -- that should be equivalent to a ConftoolEntity.update call.

Wed, Apr 17, 5:01 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations
CDanis added a comment to T360029: Integrate dbctl IP changes as part of VLAN changes. .

@Marostegui As it turns out, plain old confctl can be used to do this already.

Wed, Apr 17, 4:07 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations
CDanis added a comment to T360029: Integrate dbctl IP changes as part of VLAN changes. .

Actually the idea is that dbctl should not contain the IPs at all. It should look up the IP via DNS, we should store FQDN instead.

Wed, Apr 17, 3:59 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations

Tue, Apr 16

CDanis updated the task description for T362719: Upgrade Jaeger to 1.56.0 (latest stable).
Tue, Apr 16, 9:16 PM · Patch-For-Review, User-fgiunchedi, Observability-Tracing
CDanis removed a project from T362719: Upgrade Jaeger to 1.56.0 (latest stable): Epic.
Tue, Apr 16, 9:16 PM · Patch-For-Review, User-fgiunchedi, Observability-Tracing
CDanis created T362719: Upgrade Jaeger to 1.56.0 (latest stable).
Tue, Apr 16, 9:16 PM · Patch-For-Review, User-fgiunchedi, Observability-Tracing
CDanis committed rOSCT2474362169a1: force enable etcd v2 proto.
force enable etcd v2 proto
Tue, Apr 16, 3:05 PM

Mon, Apr 15

CDanis committed rOSCTd2ad7ee548ae: add python 3.11.
add python 3.11
Mon, Apr 15, 10:40 PM
CDanis committed rOSCTf1dd336c7537: Fix nuisance black diffs.
Fix nuisance black diffs
Mon, Apr 15, 10:40 PM

Thu, Apr 11

CDanis updated the title for P60444 tzdump.py from untitled to tzdump.py.
Thu, Apr 11, 7:07 PM
CDanis created P60444 tzdump.py.
Thu, Apr 11, 6:54 PM

Wed, Apr 10

CDanis created P60266 (An Untitled Masterwork).
Wed, Apr 10, 4:15 PM

Wed, Mar 27

CDanis closed T359413: Miniature images from og:image not loading in social media links as Resolved.
Wed, Mar 27, 7:26 PM · Traffic, PageImages, Regression, WMF-General-or-Unknown
CDanis added a comment to T359413: Miniature images from og:image not loading in social media links.

This has been fixed with this patch, which I forgot to associate with this bug.

Wed, Mar 27, 7:26 PM · Traffic, PageImages, Regression, WMF-General-or-Unknown

Mar 26 2024

CDanis added a member for WMF-NDA: fkaelin.
Mar 26 2024, 2:48 PM
CDanis added a member for WMF-NDA: Pablo.
Mar 26 2024, 2:48 PM

Mar 25 2024

CDanis added a comment to T360029: Integrate dbctl IP changes as part of VLAN changes. .

Just to make sure I understand, the request here is an easy-to-automate way of dbctl to change the instance IP address?

Mar 25 2024, 3:37 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations
CDanis added a project to T360029: Integrate dbctl IP changes as part of VLAN changes. : conftool.
Mar 25 2024, 3:34 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations

Mar 1 2024

DAlangi_WMF awarded T357050: editResponseTime's port to statslib is not actually backwards-compatible a Barnstar token.
Mar 1 2024, 5:47 PM · MediaWiki-libs-Stats, MW-1.42-notes (1.42.0-wmf.18; 2024-02-13)

Feb 26 2024

CDanis added a comment to T357750: Phase out cergen.

Should this ticket really be "deprecate cergen"? :)

Feb 26 2024, 4:00 PM · Patch-For-Review, Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
CDanis added a comment to T358455: Primary outbound port utilisation over 80% alert muted.

This would best be fixed by extending the haproxy bwlim work done in T317799 -- we've talked about having per-ASN limits in addition to the existing and partially-deployed per-file-URI limits.

Feb 26 2024, 3:45 PM · Traffic, Sustainability (Incident Followup), Infrastructure-Foundations, netops
CDanis claimed T358189: aux-k8s cluster prometheus setup is incomplete.
Feb 26 2024, 3:28 PM · Infrastructure-Foundations, Observability-Tracing

Feb 22 2024

CDanis added a comment to T358111: oauth2-proxy config changes don't cause any change in the helm Deployment.

Sent upstream as https://github.com/jaegertracing/helm-charts/pull/541

Feb 22 2024, 10:15 PM · Observability-Tracing, Patch-For-Review
CDanis closed T358152: troubleshoot why initial pageloads of trace.wikimedia.org are so slow , a subtask of T320549: distributed tracing v0 [minimum viable], as Resolved.
Feb 22 2024, 9:23 PM · Epic, Observability-Tracing
CDanis closed T358152: troubleshoot why initial pageloads of trace.wikimedia.org are so slow as Resolved.
Feb 22 2024, 9:23 PM · Observability-Tracing
CDanis added a comment to T358152: troubleshoot why initial pageloads of trace.wikimedia.org are so slow .

As it turns out, this required a change to the upstream chart:

Feb 22 2024, 9:23 PM · Observability-Tracing
CDanis added a parent task for T358111: oauth2-proxy config changes don't cause any change in the helm Deployment: T321211: distributed tracing v1: tech debt blockers.
Feb 22 2024, 12:46 PM · Observability-Tracing, Patch-For-Review
CDanis added a subtask for T321211: distributed tracing v1: tech debt blockers: T358111: oauth2-proxy config changes don't cause any change in the helm Deployment.
Feb 22 2024, 12:46 PM · Observability-Tracing, Epic
CDanis added a subtask for T320549: distributed tracing v0 [minimum viable]: T358152: troubleshoot why initial pageloads of trace.wikimedia.org are so slow .
Feb 22 2024, 12:45 PM · Epic, Observability-Tracing
CDanis added a parent task for T358152: troubleshoot why initial pageloads of trace.wikimedia.org are so slow : T320549: distributed tracing v0 [minimum viable].
Feb 22 2024, 12:45 PM · Observability-Tracing

Feb 21 2024

CDanis created T358152: troubleshoot why initial pageloads of trace.wikimedia.org are so slow .
Feb 21 2024, 9:53 PM · Observability-Tracing
CDanis updated the task description for T358111: oauth2-proxy config changes don't cause any change in the helm Deployment.
Feb 21 2024, 3:01 PM · Observability-Tracing, Patch-For-Review
CDanis created T358111: oauth2-proxy config changes don't cause any change in the helm Deployment.
Feb 21 2024, 2:56 PM · Observability-Tracing, Patch-For-Review

Feb 16 2024

CDanis added a comment to T320555: cas-sso idp for jaeger-ui on k8s.

I've verified that oauth2-proxy will silently just serve plain HTTP if you specify https_address but don't provide it with TLS key material. So I think I've provided it with such in this patch?

Feb 16 2024, 7:49 PM · User-fgiunchedi, Observability-Tracing
CDanis committed rLPRI6635d0265938: Add faux secret for jaeger in idp.
Add faux secret for jaeger in idp
Feb 16 2024, 4:09 PM

Feb 9 2024

CDanis awarded T140365: Lower geodns TTLs from 600 (10min) to 300 (5min) a Love token.
Feb 9 2024, 7:24 PM · Traffic, SRE
CDanis added a comment to T356661: Cross fleet runc upgrades.

All pods on k8s-aux-eqiad restarted, thanks @akosiaris for the script.

Feb 9 2024, 6:10 PM · serviceops

Feb 8 2024

CDanis added a subtask for T354435: 1.42.0-wmf.17 deployment blockers: T357050: editResponseTime's port to statslib is not actually backwards-compatible.
Feb 8 2024, 7:03 PM · User-brennen, Release-Engineering-Team (Priority Backlog 📥), Release, Train Deployments
CDanis added a parent task for T357050: editResponseTime's port to statslib is not actually backwards-compatible: T354435: 1.42.0-wmf.17 deployment blockers.
Feb 8 2024, 7:03 PM · MediaWiki-libs-Stats, MW-1.42-notes (1.42.0-wmf.18; 2024-02-13)
CDanis triaged T357050: editResponseTime's port to statslib is not actually backwards-compatible as High priority.
Feb 8 2024, 7:03 PM · MediaWiki-libs-Stats, MW-1.42-notes (1.42.0-wmf.18; 2024-02-13)

Feb 7 2024

CDanis added a comment to T356788: thanos-query probedown due to OOM of both eqiad titan frontends.

Per docs, Thanos supports logging when a query is received but before it begins execution:

Feb 7 2024, 4:19 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Sustainability (Incident Followup), SRE, observability

Feb 6 2024

Lens0021 awarded T276486: gerrit's sshd is incompatible with RSA pubkeys + Fedora 33 clients (and future versions of OpenSSH proper) a Party Time token.
Feb 6 2024, 9:49 AM · Gerrit (Gerrit 3.6), Upstream

Feb 5 2024

CDanis created P56252 (An Untitled Masterwork).
Feb 5 2024, 5:43 PM

Jan 29 2024

CDanis claimed T332024: GeoIP mapping experiments.
Jan 29 2024, 4:20 PM · Patch-For-Review, SRE, Infrastructure-Foundations, Traffic
CDanis claimed T342624: NetworkProbeLimit cookie should set samesite attribute.
Jan 29 2024, 4:19 PM · Patch-For-Review, SRE, Infrastructure-Foundations, Traffic
CDanis triaged T349807: NEL: don't alert on domains we don't control as Medium priority.
Jan 29 2024, 3:41 PM · SRE, Infrastructure-Foundations, Traffic
CDanis updated subscribers of T329331: create a puppetized abstraction for haproxy blocklist hysteresis.

@Fabfur just wanted to make sure you've seen this task, it is decent documentation of the existing mechanism and probably helpful for doing T353910

Jan 29 2024, 3:40 PM · SRE, Traffic

Jan 24 2024

CDanis closed T266783: move tunnelencabulator's repo to a Wikimedia-owned space as Resolved.

The script was added to the wmf-sre-laptop package in May 2023 with this commit

Jan 24 2024, 5:26 PM · Infrastructure-Foundations
CDanis closed T266783: move tunnelencabulator's repo to a Wikimedia-owned space, a subtask of T244761: Script to point SRE local machine traffic to another LB, as Resolved.
Jan 24 2024, 5:26 PM · SRE
CDanis claimed T355750: CFSSL gencert "remote error: tls: certificate require".
Jan 24 2024, 4:13 PM · CFSSL-PKI, Infrastructure-Foundations

Jan 22 2024

CDanis closed T337318: decide on an aggregation function to combine multiple probes into a single measurement, a subtask of T332024: GeoIP mapping experiments, as Resolved.
Jan 22 2024, 4:11 PM · Patch-For-Review, SRE, Infrastructure-Foundations, Traffic
CDanis closed T337318: decide on an aggregation function to combine multiple probes into a single measurement as Resolved.
Jan 22 2024, 4:11 PM · SRE, Traffic, Infrastructure-Foundations
CDanis closed T336947: Mapping Client IPs to Resolver IPs as Declined.

Probably the "best" way to solve this is via the Alt-Svc mechanism, which Traffic means to experiment with at some point (T208242: Investigate using RFC 7838 Alternate Services to better optimize edge connections). That's something we want to do anyway, independent of this issue, and will also require a lot less new infrastructure work to support than the other alternatives here

Jan 22 2024, 3:56 PM · SRE, Infrastructure-Foundations, Traffic
CDanis closed T336947: Mapping Client IPs to Resolver IPs, a subtask of T332024: GeoIP mapping experiments, as Declined.
Jan 22 2024, 3:56 PM · Patch-For-Review, SRE, Infrastructure-Foundations, Traffic
CDanis claimed T266783: move tunnelencabulator's repo to a Wikimedia-owned space.
Jan 22 2024, 3:50 PM · Infrastructure-Foundations

Dec 15 2023

CDanis added a comment to T351566: enable tracing on mwdebug hosts.

Hey,

So, a couple of questions:

  • profile::opentelemetry::collector, has 2 optional parameters, $otel_gateway_fqdn and $otel_gateway_otlp_port. Looking at the puppet code, if we don't supply these, we won't be configuring an otlp exporter (the otlp receiver will be enabled regardless and it's anyway orthogonal to the exporter). Are there plans to enable it in the near future?
Dec 15 2023, 9:23 PM · Patch-For-Review, Observability-Tracing

Dec 11 2023

CDanis created P54332 (An Untitled Masterwork).
Dec 11 2023, 7:53 PM
CDanis updated subscribers of T324020: Load IP ranges in reverse-proxy.php from Netbox/Puppet network module.

hi serviceops, any plans to work on this soon? I/F would be happy to help with an implementation but we kind of want serviceops to figure out the right approach. cc @Kappakayala

Dec 11 2023, 3:52 PM · serviceops

Dec 7 2023

CDanis closed T335637: Set cookie in Varnish to start a probe, a subtask of T332024: GeoIP mapping experiments, as Resolved.
Dec 7 2023, 3:58 PM · Patch-For-Review, SRE, Infrastructure-Foundations, Traffic
CDanis closed T335637: Set cookie in Varnish to start a probe as Resolved.
Dec 7 2023, 3:58 PM · Infrastructure-Foundations, Traffic

Nov 30 2023

CDanis added a comment to T352444: CirrusSearch generates a massive amount of "poolcounter-connection-error" messages.

Likely offending patch identified and is being reverted https://gerrit.wikimedia.org/r/c/mediawiki/core/+/979079

Nov 30 2023, 4:04 PM · Beta-Cluster-reproducible, MW-1.42-notes (1.42.0-wmf.9; 2023-12-12), Discovery-Search, CirrusSearch, Wikimedia-production-error

Nov 29 2023

CDanis added a comment to T347565: Switch rsyslog to use the new PKI infrastructure.

Conclusion at end of meeting was that o11y would migrate the base profile
to use the new cfssl support ~next week

Nov 29 2023, 4:31 PM · Observability-Logging, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE

Nov 28 2023

CDanis added a comment to T352182: 503 Service Unavailable (all Wikimedia sites).

Users are no longer impacted.

Nov 28 2023, 4:05 PM · Wikimedia-Incident, SRE
CDanis added a comment to T349244: Q1:Install cp11[00-15] and rotate into production.

Looping in @CDanis as the original author for the cp1075 hiera overrides.
Do you think we can safely remove them or do we need to apply the same hiera data to another (new) cp hosts?

Nov 28 2023, 2:47 PM · ops-eqiad, DC-Ops, Traffic, SRE

Nov 17 2023

CDanis created T351567: enable tracing on mw-on-k8s debug pods.
Nov 17 2023, 8:05 PM · Observability-Tracing
CDanis added a subtask for T320549: distributed tracing v0 [minimum viable]: T351566: enable tracing on mwdebug hosts.
Nov 17 2023, 8:03 PM · Epic, Observability-Tracing
CDanis added a parent task for T351566: enable tracing on mwdebug hosts: T320549: distributed tracing v0 [minimum viable].
Nov 17 2023, 8:02 PM · Patch-For-Review, Observability-Tracing
CDanis created T351566: enable tracing on mwdebug hosts.
Nov 17 2023, 8:01 PM · Patch-For-Review, Observability-Tracing

Oct 30 2023

CDanis reopened T285347: Improve automation of CirrusSearch caches during database switchover as "Open".
Oct 30 2023, 4:18 PM · CirrusSearch, Discovery-Search, Datacenter-Switchover

Oct 27 2023

CDanis added a comment to T340573: Add support for request tracing to WikimediaDebug browser extension.

Thanks so much @pmiazga ! This is great to see and the spec chosen for the header sounds good.

Oct 27 2023, 7:22 PM · MW-1.41-notes (1.41.0-wmf.27; 2023-09-19), MediaWiki-Platform-Team, WikimediaDebug, Observability-Tracing

Sep 27 2023

CDanis edited projects for T344171: Reverse DNS for k8s pods IPs, added: serviceops; removed SRE.
Sep 27 2023, 3:46 PM · serviceops, Prod-Kubernetes, Kubernetes

Sep 26 2023

CDanis created T347430: [Data Platform] Install a Prometheus connector for Presto, pointed at thanos-query.
Sep 26 2023, 9:00 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Data-Platform-SRE, SRE Observability
CDanis triaged T344171: Reverse DNS for k8s pods IPs as Low priority.
Sep 26 2023, 6:39 PM · serviceops, Prod-Kubernetes, Kubernetes
CDanis added a comment to T347416: File uploads to Commons mostly not working.

This was very very likely a result of https://www.wikimediastatus.net/incidents/jrtsgd7jcpbl

Sep 26 2023, 5:49 PM
CDanis closed T345726: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team., a subtask of T340648: [Airflow] Setup Airflow instance for WMDE, as Resolved.
Sep 26 2023, 2:33 PM · Patch-For-Review, Data-Platform-SRE (2024.01.01 - 2024.01.21)
CDanis closed T345726: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team., a subtask of T342331: [EPIC] Set up a sustainable tech stack for Wikidata Analytics, as Resolved.
Sep 26 2023, 2:33 PM · Wikidata Analytics (Kanban), Wikidata, Epic
CDanis closed T345726: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team. as Resolved.

Will be live in half an hour.

Sep 26 2023, 2:32 PM · Data-Platform-SRE, SRE, SRE-Access-Requests
CDanis added a comment to T345726: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team..

We discussed this in our Monday I/F meeting and approved it.

Sep 26 2023, 2:31 PM · Data-Platform-SRE, SRE, SRE-Access-Requests

Sep 25 2023

CDanis reassigned T342968: Requesting access to releasers-wikibase for darthmon_wmde from Eevans to darthmon_wmde.
Sep 25 2023, 5:42 PM · SRE, SRE-Access-Requests
CDanis added a project to T347318: db2109 crashed: ops-codfw.

Nothing recent in the SAL.

Sep 25 2023, 4:22 PM · ops-codfw, DBA
CDanis closed T342588: Requesting access to analytics-privatedata-users for Nat Hillard as Resolved.

Hi Issac, sorry this slipped through SRE's process as well -- this should have been taken care of last week.

Sep 25 2023, 3:58 PM · SRE, SRE-Access-Requests
CDanis changed the status of T342588: Requesting access to analytics-privatedata-users for Nat Hillard from Stalled to In Progress.
Sep 25 2023, 2:37 PM · SRE, SRE-Access-Requests

Sep 18 2023

CDanis closed T252890: scrape ripe atlas data for a few anchors at other large networks as Declined.

@CDanis Is that still needed now that we have NEL?

Sep 18 2023, 2:00 PM · Infrastructure-Foundations, netops, SRE

Sep 12 2023

DAlangi_WMF awarded T344926: Propagate x-request-id header from MultiHttpClient in MediaWiki (e.g. SessionStore) a Party Time token.
Sep 12 2023, 9:31 PM · MW-1.41-notes (1.41.0-wmf.27; 2023-09-19), MediaWiki-libs-BagOStuff, MediaWiki-libs-HTTP, MediaWiki-Platform-Team

Aug 25 2023

CDanis updated subscribers of T320563: our various Envoys are configured to report traces to local OpenTelemetry Collector.

The good news: OpenTelemetry tracing support exists as of our currently-deployed version of Envoy (v1.23.10): https://www.envoyproxy.io/docs/envoy/v1.23.10/api-v3/config/trace/v3/opentelemetry.proto.html

Aug 25 2023, 6:30 PM · User-fgiunchedi, Observability-Tracing
CDanis updated the title for P51432 bulk querying Thanos for cpu frequency <= 200 MHz in the past month from untitled to bulk querying Thanos for cpu frequency <= 200 MHz in the past month.
Aug 25 2023, 5:00 PM