Page MenuHomePhabricator

faidon (Faidon Liambotis)
User

Projects (12)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 10:21 AM (497 w, 3 d)
Availability
Available
IRC Nick
paravoid
LDAP User
Faidon Liambotis
MediaWiki User
Unknown

Recent Activity

Nov 28 2022

Volker_E awarded T111653: Encrypt all the things a Love token.
Nov 28 2022, 3:07 PM · Epic, SRE

Nov 2 2022

faidon updated faidon.
Nov 2 2022, 12:09 PM

Aug 2 2022

faidon renamed T309029: eqiad/codfw: 1xVM per site for Netbox from Eqiad and codfw 1xVM per site for netboix to eqiad/codfw: 1xVM per site for Netbox.
Aug 2 2022, 10:03 AM · vm-requests, Infrastructure-Foundations, SRE

Jun 18 2022

Aklapper awarded T122144: Move most (all?) exim personal aliases to WMF ITS a Yellow Medal token.
Jun 18 2022, 10:31 AM · Infrastructure-Foundations, Epic, Mail, SRE

May 6 2022

faidon added a comment to T307026: decommission atlas-esams.

So first of all, when I look into this (#6671) Atlas' probe page, I see this:

The LIR us.wmf has shared administration of this probe.

The management of this probe is allowed for the following individuals:

  • Arzhel Younsi
  • Cathal Mooney

Which I believe means that we have two separate access rights delegation: an org-based, and then an "Other users" one with you two being explicitly assigned rights to it). So, the hope is that you should have all access rights that I have :)

May 6 2022, 1:17 PM · DC-Ops, SRE, ops-esams, decommission-hardware, Infrastructure-Foundations

May 1 2022

Krinkle awarded T117508: Make ops-l a list for humans again (no cheating) a Orange Medal token.
May 1 2022, 10:22 PM · SRE

Mar 1 2022

faidon added a comment to T302617: Domain Ownership Verification on Various Search Properties.

You're absolutely right to be concerned about traffic from search engines. That said, I'm familiar enough with how this works to be comfortable owning it, and my PM counterpart and I (and a half dozen or so other people at the Foundation including @AndyRussG) gaze at this data often enough that we'd know if something were amiss, and we'll obviously be watchful if we decide make any changes at all.

Mar 1 2022, 11:31 AM · SRE-Unowned, User-AKlapper, SRE
faidon added a comment to T298166: Work out a strategy on Yandex's Turbo Pages.

Seek to opt-out via Yandex' webmaster tools. I have no idea how to get access to this but presumably we could work it out.

Mar 1 2022, 11:22 AM · Privacy Engineering, Performance-Team (Radar), Privacy, Product-Analytics

Feb 14 2022

faidon added a comment to T301110: Ingest webrequest sampled 1000 into logstash.

My (perhaps dated or incorrect) understanding is that:

  1. We currently have no RBAC in Logstash;
  2. Everyone in the "NDA" group have access to all data stored in Logstash;
  3. Access to access logs in general is more restricted, to a subset of NDA users, to the analytics-privatedata group (membership managed by the D/E team);
  4. sampled-1000 is a subset of access logs, available in the centrallog hosts, where only ops/roots have access to (so even more restricted)
Feb 14 2022, 2:41 PM · SRE, Observability-Logging

Jan 25 2022

faidon added a comment to T298723: Bing Webmaster Tools access request for Andrew Green.

Hi @AndyRussG - you mentioned that "[Bing] has an option to import domain verifications from Google Search Console"; is there another option, such as doing the Bing domain verification separately from anything Google-related? That would be preferrable I think. Otherwise, it sounds like this may have the potential to share non-public data that Google has for our properties, to Microsoft, and therefore I think the most prudent would be to ask for the Legal/Privacy team to evaluate and clear this ask. Hope that makes sense - thanks!

Jan 25 2022, 6:30 PM · Search-Console-access-request, SRE

Jan 3 2022

faidon added a comment to T286898: Setup new mirror server (mirror1001.wikimedia.org).

Not sure if this has been flagged by anyone else or considered but note that our mirror is an official mirror for Debian, Ubuntu and Tails. For at least Debian, sodium's IPs are in the ftp.us.debian.org rotation (and thus has to be an A/AAAA rather than a CNAME). I still see sodium's IPs there. I think Debian has some automated machinery to update these IPs but I'm not sure what triggers it - so be careful when turning off sodium. We're also a push mirror, which means that Debian's infrastructure triggers an update through SSH; not sure if this works yet?

Jan 3 2022, 9:00 PM · Infrastructure-Foundations, SRE

Dec 3 2021

faidon added a comment to T297017: MX record issue on mx2001.wikimedia.org.

@eliza we're looking into this - next update in 15mins.

Dec 3 2021, 11:59 PM · Infrastructure-Foundations, Mail, SRE

Nov 14 2021

faidon added a comment to T295650: cr1-eqiad -> Charter/AS7843 connectivity is broken.

There is definitely a noticeable difference in traffic patterns from Nov 4th or so:

Screenshot 2021-11-14 at 13-55-31 Turnilo (1 29 0).png (684×992 px, 57 KB)

Nov 14 2021, 12:05 PM · SRE, Infrastructure-Foundations, netops
faidon updated subscribers of T295650: cr1-eqiad -> Charter/AS7843 connectivity is broken.

I disabled the Equinix IXP port on cr1-eqiad, xe-3/0/6, just a few moments ago, in order to mitigate this issue. Checked with @ayounsi on IRC first, who is now aware of this task.

Nov 14 2021, 11:54 AM · SRE, Infrastructure-Foundations, netops
faidon triaged T295650: cr1-eqiad -> Charter/AS7843 connectivity is broken as High priority.
Nov 14 2021, 11:35 AM · SRE, Infrastructure-Foundations, netops

Aug 27 2021

faidon changed the status of T187929: Cloud IPv6 subnets from Open to Stalled.

There are some ongoing conversations with the WMCS team regarding the placement of their infrastructure in our network/infrastructure, and I think it would be good to resolve that first, before moving forward on implementing this. Setting this to Stalled - hope that makes sense!

Aug 27 2021, 7:04 PM · Infrastructure-Foundations, SRE, netops
faidon changed the status of T187929: Cloud IPv6 subnets, a subtask of T245495: CloudVPS: IPv6 early PoC, from Open to Stalled.
Aug 27 2021, 7:03 PM · cloud-services-team, Infrastructure-Foundations, SRE, netops

Jul 2 2021

faidon added a comment to T232343: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab).

Lists: Lists/mailman has an internet facing exim instance, separate from the mx cluster. We could front this with postfix near-term, using much the same approach as described for inbound mail, possibly the same hosts even. Longer-term we could migrate the inbound/outbound lists routing configuration into postix, and optionally integrate lists into future production mx-(in|out) clusters described above.

Jul 2 2021, 5:23 PM · Infrastructure-Foundations, User-MoritzMuehlenhoff, Mail, SRE

Jun 30 2021

faidon committed rOSNE08464857083e: Fix the wording on some of the reports output.
Fix the wording on some of the reports output
Jun 30 2021, 7:55 AM
faidon committed rOSNE0ea3c441233d: Use allowlist/blocklist instead of whitelist/blacklist.
Use allowlist/blocklist instead of whitelist/blacklist
Jun 30 2021, 7:53 AM

Jun 29 2021

faidon added a comment to T244849: Add SSO support to netbox.

Thank you @jbond for picking this up and sheperding it - appreciate it!

Jun 29 2021, 7:25 PM · Infrastructure-Foundations, User-jbond, Patch-For-Review, netbox, SRE

Jun 26 2021

faidon updated subscribers of T285539: Easing pain points caused by divergence between cloudservices and production puppet usecases .

Thank you @jbond for raising this topic!

Jun 26 2021, 9:18 PM · Puppet-Core, cloud-services-team, Patch-For-Review, User-jbond, Infrastructure-Foundations, Cloud-VPS, Cloud Services Proposals

Jun 24 2021

faidon added a member for Infrastructure-Foundations: faidon.
Jun 24 2021, 12:27 PM
faidon added a comment to T187929: Cloud IPv6 subnets.

Prioritization-wise, is there a reason why we're going for an IPv6 allocation while our IPv4 segmentation is still in flux or in progress? I fear that we're adding more features/problems to the mix without having set and implemented clear boundaries first, and making an already complex situation more complex (e.g. more filters to maintain) so I'd like to hear more about those trade offs and perhaps wait.

Jun 24 2021, 12:12 PM · Infrastructure-Foundations, SRE, netops

Jun 23 2021

faidon removed a member for WMF-NDA-Requests: crusnov.
Jun 23 2021, 1:49 PM
faidon removed a member for netbox: crusnov.
Jun 23 2021, 1:49 PM
faidon removed a member for Trusted-Contributors: crusnov.
Jun 23 2021, 1:49 PM
faidon removed a watcher for SRE-tools: crusnov.
Jun 23 2021, 1:48 PM
faidon removed a member for SRE-tools: crusnov.
Jun 23 2021, 1:48 PM

Jun 11 2021

faidon added a comment to T284614: Netbox: define strategy to track standard server configurations.

++ @faidon, who might be able to provide more feedback on this.

Jun 11 2021, 4:05 PM · Infrastructure-Foundations, netbox

Jun 1 2021

faidon added a comment to T275852: Investigate potential issues with the sudoeres env_keep values.

If you're talking about my 2014 commit… if I recall correctly¹ this was in order to minimize changes between different distribution and enforce a unified policy (this was part of a larger patch series to put some structure around sudoers). I opted into env_keep because that's what folks were most used to and "secure enough". I don't have an opinion these days or whether it should be removed or not :)

Jun 1 2021, 5:55 PM · User-jbond, Security, SRE
faidon updated the task description for T275852: Investigate potential issues with the sudoeres env_keep values.
Jun 1 2021, 5:52 PM · User-jbond, Security, SRE

May 26 2021

faidon placed T200277: OSPF metrics up for grabs.
May 26 2021, 12:37 PM · Infrastructure-Foundations, SRE, netops
faidon placed T189522: Detect IP address collisions up for grabs.
May 26 2021, 12:37 PM · Infrastructure-Foundations, SRE, netops
faidon updated the task description for T283230: Move SRE-related IRC channels to Libera.
May 26 2021, 12:19 PM · wikimedia-irc-libera, SRE

May 20 2021

faidon triaged T274234: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 as High priority.

Given a) this was linked during budgeting in the context of of our cross-DC bandwidth and for a substantial amount of cost b) off{site,line} backups is one of our priorities, I'm setting the priority of this task to High and asking our netops folks to have a look, Cc @joanna_borun.

May 20 2021, 3:24 PM · Infrastructure-Foundations, bacula, netops, SRE, Data-Persistence-Backup

Apr 19 2021

faidon added a comment to T280473: mail.wikimedia.org doesn't redirect to lists.wikimedia.org.

I killed that domain in 2014 (operations/dns 3a7f472cb3e9bcd03f0492cfdd8c0a2156f448d3). Noone has complained since to my knowledge, and I'd recommend to not reintroduce this redirect at this point. It was confusing to begin with: before that transition main mail exchangers and the mailing list service was all in the same box; these days they are (thankfully) separate, but the side-effect is that "mail" as a label is much more ambiguous. HTH!

Apr 19 2021, 1:26 PM · SRE, Wikimedia-Mailing-lists
faidon added a comment to T273114: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org.

@CDanis could you look at this soon? Thanks!

Apr 19 2021, 10:53 AM · netops, Infrastructure-Foundations, SRE, DC-Ops

Apr 16 2021

faidon added a comment to T276473: Proposed changes to the SRE Access request (Phabricator form 8).

SGTM :)

Apr 16 2021, 11:36 AM · Phabricator

Mar 17 2021

faidon assigned T277705: Netbox: convert validation reports into field validation to crusnov.

@crusnov maybe you can have a look?

Mar 17 2021, 9:14 PM · netbox

Mar 16 2021

faidon added a comment to T275711: Grant wmopbot +o permissions in #wikimedia-operations IRC channel.

I think I've implemented this -- it's been a while :)

Mar 16 2021, 12:02 PM · SRE, wikimedia-irc-libera, SRE-Access-Requests

Mar 5 2021

faidon added a comment to T276443: Formalize and share the spicerack/cumin release process.

Thanks for the explanation, but there's several points here that still need discussion.

Mar 5 2021, 1:18 PM · Cumin, SRE-tools, Infrastructure-Foundations, Spicerack

Mar 4 2021

faidon added a comment to T276440: wmcs.spicerack: Setup a host to run cookbooks from prod network.

(I'd suggest to focus on the nitty-gritty like SSH keys later -- I'm not the right person to ask for these either :)

Mar 4 2021, 1:45 PM · cloud-services-team, Infrastructure-Foundations, Spicerack, SRE-tools
faidon lowered the priority of T276443: Formalize and share the spicerack/cumin release process from Medium to Low.

Judging from the last two lines of that transcript, I've been summoned :)

Mar 4 2021, 12:47 PM · Cumin, SRE-tools, Infrastructure-Foundations, Spicerack
faidon added a comment to T276440: wmcs.spicerack: Setup a host to run cookbooks from prod network.

Could you clarify the scope between:

  1. production hosts that currently have WMCS as the service team (cloudvirt, cloudcephosd, etc.)
  2. Cloud VPSes that the WMCS team currently semi-manages (i.e. that have other roots, possibly custom puppetmasters etc.)
  3. Cloud VPSes that the WMCS team is currently managing fully (operates config mgmt such as the puppetmaster), not necessarily exclusively (e.g. I think Toolforge has additional admins)
Mar 4 2021, 12:14 PM · cloud-services-team, Infrastructure-Foundations, Spicerack, SRE-tools

Mar 3 2021

faidon added a comment to T267714: ripe-atlas-codfw is down.

I believe the Atlas is a PCEngines APU, so you'll need a null modem cable or adapter (RXD->TXD, TXD->RXD, etc.) If this is a Cisco rollover cable, it would do the trick, but your DB9<->RJ45 adapter should not be a crossover adapter, as that would swap crossover twice end-to-end and cancel each other out :)

Mar 3 2021, 11:43 AM · Infrastructure-Foundations, SRE, netops

Mar 1 2021

faidon closed T256628: use CAS-SSO for icinga.wikimedia.org authentication as Resolved.
Mar 1 2021, 8:27 PM · Icinga, CAS-SSO

Feb 13 2021

faidon added a comment to T273734: consider storing information on cloud NAT mappings.

To clarify the task's scope here, and the need from a network operations angle: as a service provider, providing effectively unrestricted IPv4 connectivity from our public cloud to the rest of the internet we need, for various reasons, the ability to identify and/or block the source of traffic in e.g. an incoming third-party report or request, and to be able to do so retroactively with timestamps into the past as well. (This is not a new requirement, nor the result of recent changes in cloud networking -- just something we're overdue for).

Feb 13 2021, 9:49 AM · cloud-services-team, Cloud-VPS

Jan 19 2021

faidon changed the status of Unknown Object (Task), a subtask of T270704: cloud: introduce new edge network architecture for eqiad1 and codfw1dev, from Open to Stalled.
Jan 19 2021, 11:00 AM · Patch-For-Review, cloud-services-team (Kanban)

Dec 7 2020

faidon added a comment to T267376: Set up IP addresses for the new wiki replicas setup.

It feels like there are multiple issues being discussed here, so perhaps it's worth breaking this down and talking about some of these issues separately? The last few comments seem to be about the IP numbering and assignment issue, so I'll focus on that below.

Dec 7 2020, 10:55 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)

Dec 5 2020

faidon reopened T269467: Upgrade nic firmware on cloudvirt1023 as "Open".

It turns out this has a weird firmware version because it's weird hardware

robh: this is a qlogic 41112 sfp adapter which is not quite the same as others

Given that it isn't broken, let's leave it as it is.

Dec 5 2020, 12:09 AM · cloud-services-team (Kanban)

Dec 4 2020

faidon added a comment to T269313: cloudvirt10[25-30] connection issues on primary nic.

OK, to add a little more color:

  • The VLAN configuration is not important. brctl addif brq7425e328-56 eno2np1 is enough to reproduce this behavior.
  • I was thinking why bridge would matter (thinking hwmode/EVB etc. originally). I had tried setting promisc mode to no effect, but with a clearer mind this morning, I tried promisc + down/up and managed to reproduce, without a bridge being involved. ip link set promisc on dev eno2np1; ip link set down dev eno2np1; ip link set up eno2np1 reproduces it, ip link set promisc off dev eno2np1 restores connectivity.
Dec 4 2020, 9:28 AM · ops-eqiad, DC-Ops, SRE, Epic, cloud-services-team (Kanban)

Dec 3 2020

faidon added a comment to T269313: cloudvirt10[25-30] connection issues on primary nic.

Arzhel nerd-sniped me with this.

Dec 3 2020, 11:01 PM · ops-eqiad, DC-Ops, SRE, Epic, cloud-services-team (Kanban)

Dec 2 2020

faidon assigned T222931: Netbox Reports Ideas and Requests to crusnov.
Dec 2 2020, 6:22 PM · Infrastructure-Foundations, netbox, User-crusnov, SRE-tools

Nov 25 2020

faidon updated the task description for T205897: Netbox: fill network topology.
Nov 25 2020, 2:27 PM · Infrastructure-Foundations, netbox, SRE

Nov 23 2020

faidon added a comment to T267714: ripe-atlas-codfw is down.

Thanks - can you file a procurement request to that effect (& then resolve this task)?

Nov 23 2020, 4:23 PM · Infrastructure-Foundations, SRE, netops
faidon reopened T175876: document all scs connections as "Open".

Per @ayounsi above, "Last missing info is cable IDs". I don't see that as having taken place yet, right? The Cables report is even emitting soft-warnings about it (warnings that we should convert to errors once this work completes). Reopening the task, as it was probably resolved by mistake.

Nov 23 2020, 7:47 AM · ops-eqiad, DC-Ops, SRE
faidon reopened T175876: document all scs connections, a subtask of T175625: scs-c1-eqiad unresponsive, as Open.
Nov 23 2020, 7:47 AM · ops-eqiad, DC-Ops, SRE

Oct 22 2020

faidon edited P13050 TCP flags combinations for Turnilo's map.
Oct 22 2020, 1:13 PM · netops, SRE
faidon created P13050 TCP flags combinations for Turnilo's map.
Oct 22 2020, 11:06 AM · netops, SRE

Oct 19 2020

faidon added a comment to T263290: Turnilo: per-second rates for wmf_netflow bytes + packets.

Yay, that's awesome! You can't imagine how much time this would save!

Oct 19 2020, 9:27 AM · Analytics-Kanban, Analytics, netops, Traffic, SRE

Oct 16 2020

faidon updated subscribers of T265393: eqiad: Netbox Error for asw2-d4-eqiad.

From the Netbox changelog ("Changelog" tab on the device) it looks like some changes were made on September 28th by @Cmjohnson and later one change on Oct 6th by @wiki_willy. Specifically:

Oct 16 2020, 1:11 PM · SRE, ops-eqiad, DC-Ops

Sep 24 2020

faidon added a comment to T263277: Collect netflow data for internal traffic.

I wonder as what kind of ASN would these flows show up as (esp. with confederations!), as well as whether we could have a dimension to be able to differentiate between internet traffic, and backhaul traffic. We'd also need a dimension of "site" to be able to filter or slice for traffic from esams to eqiad like the parent task required, right? Also see T254332, which also makes me wonder whether adding all of these different dimensions is going to start being a problem :)

Sep 24 2020, 3:27 PM · Data-Engineering-Kanban, Patch-For-Review, Data-Engineering, Traffic-Icebox, Infrastructure-Foundations, netops, SRE

Sep 21 2020

faidon added a comment to T263212: Consider balancing VRRP primaries to cr1/cr2.

BTW, one dangerous impact of this (as with all ECMP!) is that it would harder to notice a situation where we don't have enough capacity to carry regular amounts of traffic when one of the paths is down for whatever reason. We could perhaps mitigate this by tuning our monitoring to alert on 40-50% utilization, at least for the common cases of link redundancy (codfw/eqdfw, eqiad/codfw). So this will still get us extra capacity for "abnormal" conditions (like edge in eqiad but MW & Swift on codfw etc.) but still alert us to the situation where we don't have enough capacity for normal levels of traffic.

Sep 21 2020, 1:15 PM · SRE, netops

Sep 17 2020

faidon triaged T263212: Consider balancing VRRP primaries to cr1/cr2 as Medium priority.
Sep 17 2020, 11:15 PM · SRE, netops
faidon added a comment to T260363: Standardize VRRP group IDs.

SGTM!

Sep 17 2020, 11:00 PM · Infrastructure-Foundations, SRE, netops

Sep 16 2020

faidon added a comment to T261145: Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts.

Hey - this was brought to my attention, and we discussed it today at the I/F meeting. The outcome of our conversation was that @Volans and @jbond will do a final review pass and merge r621343 ~by the end of this week.

Sep 16 2020, 5:53 PM · SRE-Access-Requests, Data-Services, SRE, cloud-services-team (Kanban)

Sep 14 2020

faidon added a comment to T250053: Netbox report accounting icinga alert.

In general, I haven't been a big fan of how the Netbox errors are reported. An onsite engineer could install a bunch of new hardware one day, not have enough time to check the Netbox reports before they leave, and a week goes by before their next trip onsite...where they end up prioritizing other new tasks over fixing the error. Or if they're already at home updating the Netbox entries, but have to be onsite to verify a mismatch, it also gets pushed to the backburner, as other priorities pop up during their next site visit.

Sep 14 2020, 1:45 PM · ops-eqiad, DC-Ops, SRE

Sep 11 2020

faidon added a comment to T250053: Netbox report accounting icinga alert.

Broadly speaking:

  • We shouldn't have outstanding alerts open (or even acknowledged) for more than a few days. If there is an alert, it means there is an abnormal condition that requires fixing. If the issues require a significant amount of work to address, then a a task should be created and the alert acknowledged with the task in the comment while it's getting fixed. I'd expect the DC Ops teams to be primary for such alerts and act on them, but also everyone in SRE is expected to triage alerts and reach out to owners and file tasks about them (like @ayounsi did here)
  • If there are false positives often, then this is something that we should fix. We probably need one or more separate task for this, that describes conditions under which an alert is triggered erroneously, so that we can fix this. I'd expect the DC Ops team to be filing this task, and I/F to change the report to meet the adjusted needs.
  • The test_missing_assets_from_accounting report is already (and has always been) ignoring discrepancies for items where the purchase date is in the last 90 days. This is configurable and we can tune it further to some other value but it was picked as long enough for accounting to process invoices, and too long to have fallen out of memory (or vendor engagement is over, team changes etc.). If there is a persistent backlog in Finance >90d it'd be good to know and adjust.
Sep 11 2020, 11:19 AM · ops-eqiad, DC-Ops, SRE

Sep 7 2020

faidon added a comment to T237492: Create a second text-lb IP address for test purposes.

@BBlack @ayounsi I think this is done and can be resolved, right? Anything left here?

Sep 7 2020, 12:04 PM · Traffic-Icebox, SRE
faidon added a comment to T245161: Track down and replace very old HW.

@jcrespo & @akosiaris may I ask you to figure this out in a different task? This is a generic task about dozens of servers, so by discussing details about a couple of them we're going to lose the bigger picture :)

Sep 7 2020, 10:15 AM · Patch-For-Review, DC-Ops

Aug 18 2020

faidon added a comment to T225121: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300.

Ping? Besides the issues identified by @ayounsi just above, I see that in another comment above @ayounsi mentioned "wipe the switch" but then I saw the switch was removed. @Cmjohnson, can you confirm the switch was wiped before (or after) its removal? (Any reason we didn't go the decom task route here like we normally do?)

Aug 18 2020, 11:39 AM · netops, ops-eqiad, SRE

Aug 17 2020

faidon added a comment to T245161: Track down and replace very old HW.

@wiki_willy, what's the latest here? What's blocking us from having decom tasks for all of the items above?

Aug 17 2020, 10:05 PM · Patch-For-Review, DC-Ops
faidon added a member for acl*security_sre: faidon.
Aug 17 2020, 12:17 PM

Aug 4 2020

faidon added a comment to T245161: Track down and replace very old HW.

Bump! What's the latest here?

Aug 4 2020, 10:57 PM · Patch-For-Review, DC-Ops

Jul 22 2020

faidon reopened T257573: Remove multicast as "Open".

We still seem to have remnants of PIM-RP:

faidon@re0.cr2-codfw> show configuration | display set | match 208.80.153.194             
set interfaces lo0 unit 0 family inet address 208.80.153.194/32
Jul 22 2020, 6:12 PM · Infrastructure-Foundations, Patch-For-Review, netops, SRE, Traffic

Jul 21 2020

faidon closed T258309: Update cloudservices@wikimedia.org list permissions to allow Foundation staff to post to it as Resolved.

It looks like both of these issues are resolved now! Boldly resolving :)

Jul 21 2020, 1:24 PM · cloud-services-team (Kanban)

Jul 16 2020

faidon added a comment to T258018: ripe-atlas-eqiad IPv6 unreachable.

To give a little more context: in response to us requesting an extension for the v2 anchors, the RIPE NCC team reached out to ask if they can run a test upgrade on our of anchors (which I of course said OK to!).

Jul 16 2020, 3:47 PM · SRE, netops

Jul 4 2020

faidon added a comment to T257062: Lilypond seemingly not subject to restrictions (CVE-2020-29007).

With a cursory look from yesterday, the following issues apply or would need further investigation:

  • We did not run lilypond in a firejail due to a mediawiki-config configuration bug.
  • We did not run lilypond in safe mode, on purpose, as safe mode breaks a number of common features. With a very small sample, about 50% of our existing files break. @Platonides may have better numbers. Some of these may be intentional (for e.g. resource use), but some may be unintentional (e.g. color definition are not defined as symbols). It does not feel like the safe codepaths are well used or tested, which is a problem on its own.
  • Lilypond's code does not seem to be safe-by-default. -dsafe is not the default and only buried in the documentation. Variables/methods are unsafe by default, e.g. define-public is unsafe and define-safe-public is the safe version, rather than vice-versa. In many places the mode is not called "safe", but "safer", which is... scary. Lilypond has a --jail option that recommends instead of safe, but which is nothing but a setgid/setuid/chroot/chdir; hardly secure.
  • "The Guile interpreter is part of LilyPond, which means that Scheme can be included in LilyPond input files. There are several methods for including Scheme in LilyPond". Guile is a powerful language, with POSIX in its stdlib, as well as Dynamic FFI, essentially allowing arbitrary code execution by design. Guile has a sandboxed evaluation mode (h/t @CDanis), but Lilypond does not seem to employ it. Effectively, this a "Microsoft Excel runs macros by default" blast from the past situation :)
  • Besides the use of Scheme per se, Lilypond also uses PostScript as an intermediate format, relying on Ghostscript to convert to PNG. It does not call Ghostscript with -dSAFER, or in some cases calls it with -dNOSAFER. This is explicit, present in the version we run in production, but also as recent as with commits as recent as 2 weeks ago (Revert adding .setsafe for Ghostscript command). It also allows users to embed arbitrary postscript using \postscript, effectively allowing arbitrary code execution, even in safe mode. This is perhaps also indicative of upstream's attitude towards considering all input as trusted.
  • Similar injection code paths could be present in other backends, including e.g. its SVG output; it's unclear whether it allows arbitrary SVG elements to be included (including maybe <script>?). I also don't think we use SVG in production right now? But one could imagine an otherwise innocuous change being deployed to enable it in the future, so we should at least evaluate this or add a bunch of warnings for our future selves.
  • All in all, I think this needs to be discussed with upstream, to hopefully result into a mindset shift with regards to whether input is considered trusted or untrusted by default. In its current state, I don't think it's reasonable for users to even run this on their desktops with anything but scores they've personally handcrafted, or for distributors like Debian to ship this without warnings to that effect.
Jul 4 2020, 11:42 AM · MW-1.36-notes (1.36.0-wmf.1; 2020-07-21), MW-1.35-notes (1.35.0-wmf.41; 2020-07-14), Wikimedia-Incident, WMF-General-or-Unknown, MediaWiki-extensions-Score, Security, Security-Team
faidon added a comment to T257091: Re-enable the Score extension in safe mode.

(see T257092 for more about this)

Jul 4 2020, 1:42 AM · MediaWiki-extensions-Score, Security, Security-Team

Jul 2 2020

faidon added a comment to T254332: Add more dimensions in the netflow/pmacct/Druid pipeline.

So - how do we make progress here? Any thoughts on who/how? :) Some of these features could really make a tremendous amount of difference to our network operations and future planning, so I'm super excited about seeing these into fruition!

Jul 2 2020, 5:23 PM · Patch-For-Review, Analytics-Kanban, Analytics, netops, SRE
faidon updated the task description for T254332: Add more dimensions in the netflow/pmacct/Druid pipeline.
Jul 2 2020, 5:22 PM · Patch-For-Review, Analytics-Kanban, Analytics, netops, SRE

Jul 1 2020

faidon added a comment to T252577: Maxmind data update issues for DNS (and others?).

I was bitten by this again today - ping!

Jul 1 2020, 5:29 PM · SRE, Traffic

Jun 26 2020

faidon triaged T256498: Return asw-c8-codfw to spares as Low priority.
Jun 26 2020, 6:07 PM · ops-codfw, SRE

Jun 25 2020

faidon added a comment to T254332: Add more dimensions in the netflow/pmacct/Druid pipeline.

To add to the above, I'm also wondering how difficult it would be to also include AS *names*, e.g. coming from the MaxMind GeoIP ASN database. I think we've used that database before, maybe for pageview data? Could we perhaps use Druid lookups for this to avoid adding another (identical) dimension to the data set?

Jun 25 2020, 12:09 AM · Patch-For-Review, Analytics-Kanban, Analytics, SRE, netops

Jun 24 2020

faidon closed T219486: Send peering requests to AS with the worst TTFB as Resolved.

I took a look at that list above. It's really not very actionable -- most of these are very large networks that have a restrictive settlement-free peering policy. For the few that remain, we have either established peerings already or have sent unanswered peering requests, which mostly means that they are not actively peering or we are too small for them to care about.

Jun 24 2020, 11:30 PM · AS-Report, Traffic, Performance-Team, SRE
faidon updated subscribers of T254332: Add more dimensions in the netflow/pmacct/Druid pipeline.
Jun 24 2020, 10:15 PM · Patch-For-Review, Analytics-Kanban, Analytics, SRE, netops

Jun 18 2020

faidon updated the task description for T245161: Track down and replace very old HW.
Jun 18 2020, 10:31 AM · Patch-For-Review, DC-Ops
faidon updated the task description for T245161: Track down and replace very old HW.
Jun 18 2020, 10:25 AM · Patch-For-Review, DC-Ops

Jun 11 2020

faidon added a comment to T254818: Requesting access to PROD for lmata (SRE).

Approved.

Jun 11 2020, 10:53 AM · SRE, SRE-Access-Requests

Jun 4 2020

faidon added a comment to T251536: Peer with SFMIX at ulsfo (May 2020).

This is now set up on SFMIX's end and up:

On your side please plumb 206.197.187.82/24 and 2001:504:30::ba01:4907:1/64. Usual sane BGP peering rules apply - no broadcast traffic (DHCP, CDP, etc), see https://sfmix.org/connect/guide.

We request at least one required BGP session (to our looking glass) and optional sessions for the route servers
The looking glass is AS12276 at 206.197.187.1 and 2001:504:30::ba01:2276:1. You should announce all your routes to the looking glass, but expect no routes to be announced to you.

We'll push out configs to support these peers this evening.

Jun 4 2020, 7:53 AM · netops, SRE

Jun 3 2020

faidon created T254332: Add more dimensions in the netflow/pmacct/Druid pipeline.
Jun 3 2020, 9:40 AM · Patch-For-Review, Analytics-Kanban, Analytics, SRE, netops

May 19 2020

faidon added a comment to T225121: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300.

Are there any updates to this task and any particular reasons it's been held up? While this was never super urgent, we're now at the ~one year mark since this was ordered and delivered to the data center. Plus I think because at the time the upgrade was imminent, we only bought support for the new switch and not the old, so we're operating with unsupported HW right now. It'd be great if this were to be completed soon. Thanks!

May 19 2020, 9:22 AM · netops, ops-eqiad, SRE

May 15 2020

faidon added a comment to T247881: Three ports on asw2-d-eqiad are not working as expected.

If three ports are permanently failed, I'm not sure how we could ever trust that switch again. Perhaps it's better to do a painful but planned replacement rather than have it fail at some inconvenient time and having to rush a replacement then?

May 15 2020, 12:16 PM · Infrastructure-Foundations, ops-eqiad, SRE, netops

May 12 2020

faidon added a comment to T252577: Maxmind data update issues for DNS (and others?).

I know that historically MaxMind has claimed they update the data roughly on a weekly basis, and maybe in this case it was a normal weekly update and we're just misaligned with their weeks? In any case, the current geoipdate seems to be smart enough to checksum the existing databases and not re-download pointless duplicates, so we could probably run it more often on the puppetmasters.

May 12 2020, 6:45 PM · SRE, Traffic

May 8 2020

faidon added a subtask for T251536: Peer with SFMIX at ulsfo (May 2020): Unknown Object (Task).
May 8 2020, 12:10 PM · netops, SRE
faidon removed a subtask for T251536: Peer with SFMIX at ulsfo (May 2020): Unknown Object (Task).
May 8 2020, 12:10 PM · netops, SRE
faidon added a comment to T251536: Peer with SFMIX at ulsfo (May 2020).

LoA received and cross-connect task created.

May 8 2020, 12:10 PM · netops, SRE
faidon renamed T251536: Peer with SFMIX at ulsfo (May 2020) from Peer with SFMIX at ulsfo to Peer with SFMIX at ulsfo (May 2020).
May 8 2020, 12:09 PM · netops, SRE
faidon added a subtask for T251536: Peer with SFMIX at ulsfo (May 2020): Unknown Object (Task).
May 8 2020, 12:09 PM · netops, SRE