Log In
mark
mark (Mark Bergsma)
Lead Operations Architect & Director of Technical Operations
Projects (20)
CopyPatrol
Component
DC-Ops
Group
Dumps-Rewrite
Component
Hash-Checking
Component
netops
Component
Calendar
Today
Clear sailing ahead.
Tomorrow
Clear sailing ahead.
Monday
Clear sailing ahead.
User Details
User Since
Oct 8 2014, 1:57 PM (362 w, 3 d)
Availability
Available
IRC Nick
mark
LDAP User
Mark Bergsma
MediaWiki User
Unknown
Recent Activity
Jul 26 2021
mark raised the priority of T263220: Limit concurrency of DPL queries from Medium to High.
Given that the underlying problem that this change might help with has already caused multiple full outages (all wikis affected) in the past year alone and the extension is deployed on quite a few wiki, I'd like to ask this to be looked into again for the near-term. Raising priority to 'high'. Would this be in scope for PET's Clinic Duty? How can SRE help?
Jul 26 2021, 12:31 PM · Slow-DB-Query, SecTeam-Processed, Security, Vuln-DoS, Sustainability (Incident Followup), SRE, serviceops, PoolCounter, Platform Team Workboards (Clinic Duty Team), MW-1.36-notes (1.36.0-wmf.18; 2020-11-17), Performance Issue, Patch-For-Review, DynamicPageList (Wikimedia)
Feb 19 2021
mark added a comment to T274459: Eqiad: 2 VM request for GitLab.
In T274459#6841122, @thcipriani wrote:
Whoa, catching up on scrollback overnight. My question is: is this the first anyone in SRE has heard about any of this?
Feb 19 2021, 10:18 AM · GitLab (Initialization), Patch-For-Review, User-brennen, vm-requests, SRE
Jan 22 2021
mark added a comment to T272686: print a list of backed up directories in the MOTD of production servers.
It's purely an idea I've had for a long time, to make it immediately obvious to anyone logging in what is backed up, and what isn't. That should help to:
Jan 22 2021, 11:43 AM · Data-Persistence-Backup, SRE
Oct 8 2020
mark added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).
Hi all,
Oct 8 2020, 11:52 AM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic
Sep 4 2020
mark added a comment to T262042: Security Issue Access Request for LSobanski.
Approved.
Sep 4 2020, 1:44 PM · Security-Team, Security
Sep 1 2020
mark added a comment to T261760: Requesting access to Production for lsobanski.
Approved.
Sep 1 2020, 3:24 PM · SRE, SRE-Access-Requests
mark added a comment to T261626: Requesting access to Production for klausman.
Approved.
Sep 1 2020, 10:34 AM · SRE, SRE-Access-Requests
Aug 20 2020
mark updated subscribers of T260764: backup2001 RAID controller failure, unable to post 2020-08-19.
@wiki_willy @Papaul It seems we've had an ongoing pattern of crashes with this (rather important) backup host, which means we are not yet able to trust it. Until we are able to resolve this we also cannot decommission the older hosts (that this replaces) either. At the moment the system doesn't even boot. Are there any steps we can take soon to debug this issue? Anything we can help with? Thanks!
Aug 20 2020, 10:35 AM · SRE, ops-codfw
Jul 10 2020
mark added a comment to T256451: Security Issue Access Request for Kormat.
Approved.
Jul 10 2020, 9:17 AM · User-Kormat, Security-Team, Security
May 27 2020
mark added a project to T247028: Database 'INSERT' query rate doubled (module_deps regression?): Platform Team Workboards (Clinic Duty Team).
May 27 2020, 10:43 AM · MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), Sustainability (Incident Followup), Performance Issue, Performance-Team, MediaWiki-ResourceLoader
Apr 7 2020
mark added a project to T157651: sql.php must not run LoadExtensionSchemaUpdates​: DBA.
Apr 7 2020, 12:11 PM · Sustainability (Incident Followup), MW-1.35-notes (1.35.0-wmf.30; 2020-04-28), Wikidata, Growth-Team, StructuredDiscussions, Platform Team Workboards (Clinic Duty Team), Patch-For-Review, Performance-Team, MediaWiki-Maintenance-system
Feb 21 2020
mark added a comment to T245520: 2*10G optics down on cr2-esams.
I am pretty sure there are a bunch of optics (of various kinds) in the "spare" switches, in the bottom of rack OE15. Unfortunately those switches are not powered up, and certainly not configured and remote manageable - something we should probably fix on next visit.
Feb 21 2020, 12:17 PM · ops-esams, netops, SRE
Feb 18 2020
mark added a comment to T245520: 2*10G optics down on cr2-esams.
There are multiple 10G LR optics on-site for sure. Longer distance ones, less so.
Feb 18 2020, 2:59 PM · ops-esams, netops, SRE
Feb 13 2020
mark added a comment to T245060: Pybal should reject a confctl configuration that indicates only one cp-text is pooled.
Personally I don't think Pybal should be rejecting that; it's a valid configuration from a technical standpoint, and there can be valid reasons to have it, at least temporarily. But we may decide that in our specific environment that should be avoided at all cost, so perhaps that logic should be implemented elsewhere - in the code that manages pooling state.
Feb 13 2020, 11:49 AM · Sustainability (Incident Followup), Pybal, SRE
Feb 12 2020
mark added a comment to T236437: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet.
@wiki_willy With Chris having been ill the past few days, what's a realistic new ETA for this?
Feb 12 2020, 4:47 PM · serviceops, SRE
Dec 10 2019
mark added a comment to T238909: Proposal: simplify set up of a new load-balanced service on kubernetes.
In T238909#5727693, @akosiaris wrote:
Dec 10 2019, 12:02 PM · SRE, Prod-Kubernetes, Pybal, Traffic, serviceops
mark added a comment to T238909: Proposal: simplify set up of a new load-balanced service on kubernetes.
I agree - it seems that PyBal adds no real value here, because it's essentially load balancing the k8s load balancers. Why couldn't our caching layer do that directly, and know about all the k8s proxies/nodes directly and do health checks for them?
Dec 10 2019, 11:43 AM · SRE, Prod-Kubernetes, Pybal, Traffic, serviceops
Nov 28 2019
mark moved T237041: wipe backup-array1 from Backlog to Blocked on the ops-esams board.
Nov 28 2019, 11:34 AM · SRE, ops-esams
mark moved T174637: Setup esams atlas anchor from Racking Tasks to Blocked on the ops-esams board.
Nov 28 2019, 11:34 AM · SRE, netops, ops-esams
Nov 27 2019
mark added a comment to T184066: rack/setup/install ps[12]-oe1[456]-esams.
In T184066#5695891, @RobH wrote:
In T184066#5694288, @Papaul wrote:
qfx5100-spare1, psu 0 {#20156} to ps2-oe15-esams:17
qfx5100-spare2, psu 0 {#20157} to ps2-oe15-esams:16
qfx5100-spare1, psu 1 {#20159} to ps1-oe15-esams:2
qfx5100-spare2, psu 1 {#20158} to ps1-oe15-esams:3
asw2-oe16-esams:psu0 {#20162} to ps2-oe16-esams:26
asw2-oe16-esams:psu1 {#20164} to ps1-oe16-esams:26
All the above are done, but NOT
scs1-oe15-esams:psu1 {#20163} to ps2-oe15-esams:34
scs1-oe15-esams:psu2 {#20164} to ps1-oe15-esams:34
as there is no scs-oe15-esams, not sure what that is. Mark's comment T184066#5694430 covers scs-oe16-esams.
Nov 27 2019, 11:05 AM · SRE, ops-esams
Nov 26 2019
mark added a comment to T184066: rack/setup/install ps[12]-oe1[456]-esams.
scs1-oe16-esams:psu1 {#20163} to ps2-oe16-esams:34
scs1-oe16-esams:psu2 {#20164} to ps1-oe16-esams:34
Nov 26 2019, 5:54 PM · SRE, ops-esams
mark closed T238835: apply asset tags to cable managers as Resolved.
Nov 26 2019, 4:39 PM · SRE, ops-esams
mark moved T237009: Add missing labels for equipment and cables from Procurement to Blocked on the ops-esams board.
Nov 26 2019, 4:38 PM · DC-Ops, SRE, ops-esams
mark updated the task description for T237009: Add missing labels for equipment and cables.
Nov 26 2019, 4:12 PM · DC-Ops, SRE, ops-esams
mark added a comment to T237009: Add missing labels for equipment and cables.
cr3-esams now has its power cables labeled:
Nov 26 2019, 4:11 PM · DC-Ops, SRE, ops-esams
mark added a comment to T237009: Add missing labels for equipment and cables.
cr2-esams now has its power cables labeled:
Nov 26 2019, 3:58 PM · DC-Ops, SRE, ops-esams
mark closed T237006: Relabel cables with duplicate IDs as Resolved.
All duplicate ids have been fixed, labels replaced for one pair and updated in netbox.
Nov 26 2019, 3:00 PM · SRE, ops-esams
mark updated the task description for T237009: Add missing labels for equipment and cables.
Nov 26 2019, 2:36 PM · DC-Ops, SRE, ops-esams
mark added a comment to T237009: Add missing labels for equipment and cables.
I've filled out all red cells in the (original) bootstrap spreadsheet.
Nov 26 2019, 2:36 PM · DC-Ops, SRE, ops-esams
mark added a comment to T238835: apply asset tags to cable managers.
All 7 cable managers have been asset tagged and put into Netbox with the appropriate info and rack position.
Nov 26 2019, 2:26 PM · SRE, ops-esams
mark added a comment to T237009: Add missing labels for equipment and cables.
All SERVER power cords have been audited in this sheet: https://docs.google.com/spreadsheets/d/1RMb6lMCc94wUj6MgSm1yYdnAC3SUsZIRj8zHLtxRx4o/edit?usp=sharing
Nov 26 2019, 1:26 PM · DC-Ops, SRE, ops-esams
mark updated the task description for T237009: Add missing labels for equipment and cables.
Nov 26 2019, 1:25 PM · DC-Ops, SRE, ops-esams
mark closed T237014: Update spare QFX labels as Resolved.
Done.
Nov 26 2019, 10:15 AM · ops-esams, SRE
Nov 25 2019
mark updated the task description for T237030: Setup new MX204 in knams.
Nov 25 2019, 6:14 PM · netops, ops-esams, SRE
mark updated the task description for T237030: Setup new MX204 in knams.
Nov 25 2019, 5:43 PM · netops, ops-esams, SRE
Nov 4 2019
mark moved T234450: Special:Contributions requests with a high &limit= caused excessive database load from Done to Discussing on the Platform Team Workboards (Clinic Duty Team) board.
CPT: please take a new look, thanks :)
Nov 4 2019, 5:17 PM · MW-1.31-release-notes, MW-1.33-notes, MW-1.34-notes, Platform Engineering, Security, MW-1.35-notes (1.35.0-wmf.5; 2019-11-05), User-notice, Vuln-DoS, Performance Issue, MediaWiki-Special-pages, Wikimedia-production-error
Oct 24 2019
mark added a comment to T232887: The phabricator server, WMF7426, was given to us temporarily, we would like to make it permanent.
I'm a bit confused; as far as I know the old plan was always to have HA of Phabricator between eqiad and codfw, and the linked task T190572 also talks about that. So is that no longer the case, and if so, why is that? I believe there have been blockers & complications for that deployment, but are they documented anywhere? How does this task relate to those plans, why do we feel failover within eqiad is (also) needed?
Oct 24 2019, 3:57 PM · SRE, hardware-requests, Release-Engineering-Team (Development services), serviceops, Phabricator
Oct 22 2019
mark added projects to T234450: Special:Contributions requests with a high &limit= caused excessive database load: Platform Engineering, Platform Team Workboards (Clinic Duty Team).
Could CPT take a look at this please? Thanks!
Oct 22 2019, 9:54 AM · MW-1.31-release-notes, MW-1.33-notes, MW-1.34-notes, Platform Engineering, Security, MW-1.35-notes (1.35.0-wmf.5; 2019-11-05), User-notice, Vuln-DoS, Performance Issue, MediaWiki-Special-pages, Wikimedia-production-error
Sep 17 2019
mark added a comment to T231387: Updating DNS records (pr.wikimedia.org).
What's the status of this? Is this done and working?
Sep 17 2019, 12:25 PM · Mail, WMF-Communications, SRE
Sep 12 2019
mark added a project to T231387: Updating DNS records (pr.wikimedia.org): Mail.
Sep 12 2019, 12:55 PM · Mail, WMF-Communications, SRE
mark updated subscribers of T231387: Updating DNS records (pr.wikimedia.org).
In T231387#5471833, @Varnent wrote:
@mark - Thank you very much for that thoughtful and helpful reply!
Talking it over, we would like to try the first option if you believe that will work.
So how do we go about getting this setup?
Anusha Alikhan
aalikhan@pr.wikimedia.org
Samantha Lien
slien@pr.wikimedia.org
Sep 12 2019, 12:54 PM · Mail, WMF-Communications, SRE
Sep 5 2019
mark changed the status of T231387: Updating DNS records (pr.wikimedia.org) from Stalled to Open.
Sep 5 2019, 2:32 PM · Mail, WMF-Communications, SRE
mark added a comment to T231387: Updating DNS records (pr.wikimedia.org).
Hi Anusha, Greg,
Sep 5 2019, 2:32 PM · Mail, WMF-Communications, SRE
Aug 9 2019
mark added a comment to T229755: csw2-esams's VCP link flapped.
EX4200 can also have any port converted as VC - just won't be as fast, max 10Gbps.
Aug 9 2019, 10:02 AM · SRE, netops
Aug 6 2019
mark added a comment to T229860: SRE Onboarding for Sukhbir Singh.
Approved for access.
Aug 6 2019, 11:13 AM · SRE-Access-Requests, Traffic, SRE
Jul 23 2019
mark raised the priority of T228720: stub for enwiki broken, attempt to load content for bad rev during sha1 retrieval from High to Unbreak Now!.
Because this means that right now stub dumps generation for (at least) enwiki and dewiki and several other is broken, we have only a few days to fix this before the dumps need to be done at the end of the month. Setting UBN...
Jul 23 2019, 1:40 PM · Platform Team Initiatives (MCR), MW-1.34-notes (1.34.0-wmf.14; 2019-07-16), Dumps-Generation
Apr 16 2019
mark renamed T218570: DB planning: include a writeable (?) misc DB cluster in codfw for WMCS from DB planning: include a misc cluster in codfw to DB planning: include a writeable (?) misc DB cluster in codfw for WMCS.
Apr 16 2019, 10:43 AM · DBA, cloud-services-team (Kanban)
Apr 5 2019
mark updated the task description for T219805: Investigate Doctrine DBAL usage possibility.
Apr 5 2019, 11:13 AM · User-Addshore, Wikidata-Trailblazing-Exploration​, Wikidata, TechCom, Patch-For-Review
mark added a comment to T219805: Investigate Doctrine DBAL usage possibility.
While I agree with Daniel and others that the use of the MediaWiki db connection/load balancing layer is an absolute minimum requirement, there are quite a few other potential problems that could affect the security/privacy, reliability or maintainability of our data and services, if Doctrine is to be used to access MediaWiki's existing databases in any way (it's definitely easier if done in separate, not connected database clusters). However this ticket so far is very sparse on details, and we don't have the information we need to make an informed decision. I've requested access to the linked document yesterday, but so far it wasn't granted yet. Alternatively, could this perhaps be replicated here on Phabricator so everyone involved can build an informed opinion? Thanks. :)
Apr 5 2019, 11:13 AM · User-Addshore, Wikidata-Trailblazing-Exploration​, Wikidata, TechCom, Patch-For-Review
Apr 1 2019
mark added a project to T190379: RFC: Re-establish the development policies: DBA.
There has been some concern from our DBAs the archiving of the old policy will make it even harder for developers to find out about what database-related requirements their code should fulfill, and what the processes would be to get any schema or query changes deployed (such as a link to the Schema_changes page). The old information on database related requirements, while admittedly a bit outdated, was discussed as an RFC at the time: https://www.mediawiki.org/wiki/Architecture_meetings/RFC_review_2015-09-16
Apr 1 2019, 1:08 PM · DBA, Performance-Team, TechCom-RFC (TechCom-RFC-Closed), TechCom
Mar 22 2019
Effie Mouzeli <effie@wikimedia.org> committed rMSCA5e1eced094fe: Add unit testing of scap main.py (authored by mark).
Add unit testing of scap main.py
Mar 22 2019, 11:33 AM
Mar 21 2019
Mill <mill@mail.com> committed rMSCA135f64c71c56: 3%5eaaaaaaaaaaaa (authored by mark).
3%5eaaaaaaaaaaaa
Mar 21 2019, 12:11 AM
Mar 6 2019
Effie Mouzeli <effie@wikimedia.org> committed rMSCA8d204fe0b7a9: WiP: Add unit testing of scap main.py (authored by mark).
WiP: Add unit testing of scap main.py
Mar 6 2019, 7:37 PM
Effie Mouzeli <effie@wikimedia.org> committed rMSCA2ab9d6f3e4d9: WiP: Add unit testing of scap main.py (authored by mark).
WiP: Add unit testing of scap main.py
Mar 6 2019, 6:27 PM
Effie Mouzeli <effie@wikimedia.org> committed rMSCAa7a532cb535f: WiP: Add unit testing of scap main.py (authored by mark).
WiP: Add unit testing of scap main.py
Mar 6 2019, 5:51 PM
Effie Mouzeli <effie@wikimedia.org> committed rMSCA705d3be59ec8: WiP: Add unit testing of scap main.py (authored by mark).
WiP: Add unit testing of scap main.py
Mar 6 2019, 4:04 PM
Feb 22 2019
mark committed rMSCAd624470dbe89: WiP: Add unit testing of scap main.py (authored by mark).
WiP: Add unit testing of scap main.py
Feb 22 2019, 5:27 PM
mark committed rMSCA3b376098e5fa: WiP: Add unit testing of scap main.py (authored by mark).
WiP: Add unit testing of scap main.py
Feb 22 2019, 5:27 PM
Feb 5 2019
mark removed a watcher for ops-codfw: mark.
Feb 5 2019, 2:37 PM
Jan 23 2019
mark added a comment to T211254: Free up 185.15.59.0/24.
In T211254#4902340, @BBlack wrote:
In T211254#4902250, @mark wrote:
It's the same basic rationale as moving WMCS out of 10.68.0.0/16. We could obviously leave them there and just manage our ACLs better with more automation, but it pays some pretty big dividends when address spaces are clearly split on such a big security and functional boundary as Prod-v-WMCS. Humans will always look at IPs as well in various debugging and configuration tasks. Having similar/shared/adjacent numbering for these two realms invites confusion and mistakes.
In a world where there's ample address space (such as 10/8 in our context), yes. In today's world where IPv4 address space is scarce and we can likely not get any more, not so much.
I would personally have preferred that with the renumbering.of WMCS they simply acquired new public IPv4 space of their own
That's simply not realistic, they can't "acquire" IPv4 address space of their own. They're part of this organisation, this ASN, and need to use our PI/PA space where we have it available before we collectively can get more.
I understand the basic concerns here about exhaustion and how the process works. I think it would've been possible to find a way to ask for new or acquire new space though, even in the US. It's just a process and a cost at the end of the day.
Jan 23 2019, 3:16 PM · Patch-For-Review, Traffic, netops, SRE
mark added a comment to T211730: Replace accepted-prefix-limit with prefix-limit.
Yes, we should probably move over to prefix-limit to prevent (improving) filters from making accepted-prefix-limit ineffective.
Jan 23 2019, 2:23 PM · SRE, netops
mark added a comment to T211254: Free up 185.15.59.0/24.
In T211254#4902223, @BBlack wrote:
It's the same basic rationale as moving WMCS out of 10.68.0.0/16. We could obviously leave them there and just manage our ACLs better with more automation, but it pays some pretty big dividends when address spaces are clearly split on such a big security and functional boundary as Prod-v-WMCS. Humans will always look at IPs as well in various debugging and configuration tasks. Having similar/shared/adjacent numbering for these two realms invites confusion and mistakes.
Jan 23 2019, 1:55 PM · Patch-For-Review, Traffic, netops, SRE
mark added a comment to T211728: Outbound BGP graceful shutdown.
Have a look at https://github.com/mwiget/bgp_graceful_shutdown for a JunOS op script (SLAX) that does this fully automatically for all peers with a single command.
Jan 23 2019, 1:38 PM · Patch-For-Review, SRE, netops
mark closed T186021: reconfigure esams switch port for new bastion as Declined.
This was solved by fixing the original bastion, a while ago.
Jan 23 2019, 1:22 PM · ops-esams, netops, SRE
mark closed T186021: reconfigure esams switch port for new bastion, a subtask of T184936: install/designate other machine as esams bastion, as Declined.
Jan 23 2019, 1:22 PM · SRE, ops-esams
mark added a comment to T211254: Free up 185.15.59.0/24.
I really don't see the point of this. With the scarcity of IPv4 space we only need to get MORE flexible about how we use our IP space, and we will almost certainly not be able to maintain production vs others split between these address blocks in the future. Rather than spend time on renumbering I think it's much more valuable to spend that effort on better managing our ACLs and more automation.
Jan 23 2019, 1:12 PM · Patch-For-Review, Traffic, netops, SRE
Jan 11 2019
mark reopened T148541: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring as "Open".
Jan 11 2019, 3:13 PM · User-fgiunchedi, Patch-For-Review, Prometheus-metrics-monitoring​, observability, SRE
mark raised the priority of T148541: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring from Medium to High.
Right now I can only find a single graph with eqiad/codfw total (aggregated) power usage, but proper per-rack power usage data is still entirely missing. This makes it currently very hard to determine the total amount of power used per rack (across all phases) and to monitor things like phase imbalance.
Jan 11 2019, 3:13 PM · User-fgiunchedi, Patch-For-Review, Prometheus-metrics-monitoring​, observability, SRE
Dec 19 2018
mark added a comment to T212129: Move MainStash out of Redis to a simpler multi-dc aware solution.
I am getting the impression here that some things are being rushed and finalized without time for a proper discussion between people/teams about the different possible solutions and their impact, after this new discovery. Is that because goals are due to be posted now?
Dec 19 2018, 3:05 PM · Performance-Team, Sustainability (MediaWiki-MultiDC), MediaWiki-General, serviceops-radar, User-mobrovac, User-jijiki, SRE
Oct 12 2018
mark moved T199677: cp3033 unreacheable since 2018-07-15 11:47:31 from Backlog to Hardware Failure / Repair on the ops-esams board.
Oct 12 2018, 2:51 PM · ops-esams, SRE, Traffic
Sep 18 2018
mark reassigned T201470: Add contint-roots to releases{1,2}001 from mark to RobH.
Sep 18 2018, 11:35 AM · Patch-For-Review, Release-Engineering-Team (Watching / External), SRE-Access-Requests, SRE
mark updated subscribers of T201470: Add contint-roots to releases{1,2}001.
Although we didn't manage to discuss this in our SRE meeting yesterday I discussed it with relevant people afterwards.
Sep 18 2018, 11:35 AM · Patch-For-Review, Release-Engineering-Team (Watching / External), SRE-Access-Requests, SRE
Sep 11 2018
mark added a comment to T204083: wikibase_shared/<current_train_version>-wikidatawiki-hhvm:CacheAwarePropertyInfoStore memcached key not well distributed, causing excessive traffic.
T97368 appears to be about the same issue.
Sep 11 2018, 9:17 PM · wdwb-tech, Performance-Team, SRE, wikiba.se website, Wikidata
mark added projects to T203039: Storage of data for recommendation API: DBA, SRE.
Sep 11 2018, 4:39 PM · Analytics, SRE, DBA, Services (designing), Research
mark added projects to T204026: DBPerformance warning "Query returned XXXX rows: query: SELECT * FROM `translate_metadata`" on Meta-Wiki: DBA, SRE.
Sep 11 2018, 1:31 PM · Patch-For-Review, Language-Team (Language-2021-July-September)​, MW-1.37-notes (1.37.0-wmf.23; 2021-09-13), MediaWiki-extensions-CentralNotice​, Wikimedia-Fundraising, Performance-Team (Radar), Datacenter-Switchover, Wikimedia-production-error, SRE, MediaWiki-extensions-Translate
mark added a comment to T203674: Debian package or files managed my puppet for pt-kill-wmf.
Indeed, let's go with a "proper" Debian package, imho the cleanest way to go and conforming to how we do things.
Sep 11 2018, 10:18 AM · User-Banyek, Puppet, SRE
Sep 3 2018
mark added a comment to T203182: Requesting access to EventLogging in Hive (analytics-privatedata-users) for Cicalese.
Yes, this can be merged once Nuria approves.
Sep 3 2018, 10:35 AM · Patch-For-Review, SRE, SRE-Access-Requests
Aug 14 2018
mark updated subscribers of T201856: Subscribe user mepps to security@wikimedia.org.
@Dzahn please get her added to this list. Thanks!
Aug 14 2018, 5:36 PM · SRE, SRE-Access-Requests
Aug 13 2018
mark updated the task description for T201694: Move servers off asw2-a-eqiad.
Aug 13 2018, 9:08 AM · Patch-For-Review, SRE, netops
Aug 10 2018
mark added a comment to T200297: Review Jade data storage and architecture proposal [RFC].
In T200297#4493122, @Halfak wrote:
I talked to @mark today. Here's what I understood from the conversation:
All of the following points assume that the TechCom discussion happens and there's a decision that the local-wiki JADE namespace is the only reasonable implementation strategy
Large wikis (enwiki, wikidatawiki, and commonswiki) are where the concerns exist. All other, smaller wikis are less of a concern.
The revision table is the only table that is a serious concern for large wikis. The page table is less of a concern.
Our estimated growth of 0.5M new revisions per large wiki per year is acceptable growth.
In order to account for fluctuations, a ceiling of 1M new revisions per large wiki per year is acceptable for JADE judgments.
Aug 10 2018, 12:37 PM · TechCom-RFC (TechCom-RFC-Closed), MW-1.33-notes (1.33.0-wmf.14; 2019-01-22), Patch-For-Review, Machine-Learning-Team (Active Tasks), DBA, SRE, Jade
Jul 30 2018
mark added a comment to T200297: Review Jade data storage and architecture proposal [RFC].
I am a bit confused by this RFC/proposal as it stands now, as I feel it doesn't really reflect the discussions we've been having.
Jul 30 2018, 2:27 PM · TechCom-RFC (TechCom-RFC-Closed), MW-1.33-notes (1.33.0-wmf.14; 2019-01-22), Patch-For-Review, Machine-Learning-Team (Active Tasks), DBA, SRE, Jade
Jul 25 2018
mark added a comment to T195923: rack/setup/install cp1075-cp1090.
In T195923#4450204, @Cmjohnson wrote:
@ayounsi I was not able to add the ports in row A to the public vlan. Can you check the following and add to public vlan. Also, adding the servers to the vlan in the other rows did not automatically enable the ports. Can you also check please.
Jul 25 2018, 12:43 PM · Patch-For-Review, ops-eqiad, Traffic, SRE
mark added a comment to T168539: Unhandled pybal error: OpenSSL.SSL.Error - ssl handshake failure.
@ema: Has this been seen again? Does this need any work in Pybal?
Jul 25 2018, 11:07 AM · Traffic, Pybal, SRE
mark moved T113597: pybal-related issue on host start can break service IPs... from Backlog to Blocked on the Pybal board.
Jul 25 2018, 10:58 AM · Traffic-Icebox, SRE, Pybal
mark moved T114104: pybal doesn't fully manage LVS table leaving stale services (on IP change) from Backlog to Blocked on the Pybal board.
Jul 25 2018, 10:58 AM · Traffic-Icebox, SRE, Pybal
mark moved T86650: Add support for setting weight=0 when depooling from Backlog to Blocked on the Pybal board.
Jul 25 2018, 10:56 AM · Traffic-Icebox, SRE, Pybal
mark moved T114979: Run IPVS in a separate network namespace from Backlog to Blocked on the Pybal board.
Jul 25 2018, 10:56 AM · Traffic-Icebox, SRE, Pybal
mark moved T172124: PyBal Feature: progressive depooling strategy for monitored failures from Backlog to Blocked on the Pybal board.
Jul 25 2018, 10:56 AM · Traffic-Icebox, Pybal, SRE
mark created T200319: Migrate Pybal to Python 3.
Jul 25 2018, 10:55 AM · User-Ladsgroup, Patch-For-Review, Python3-Porting, Pybal
mark added a comment to T200277: OSPF metrics.
The eqdfw-knams needs have a lower metric than the current primary (codfw-eqiad + eqiad-esams) links so traffic from codfw to esams prefer that link.
Jul 25 2018, 8:25 AM · Infrastructure-Foundations, netops, SRE
Jul 24 2018
Dzahn awarded T169035: bast3002 sdb broken a Like token.
Jul 24 2018, 4:39 PM · SRE, ops-esams
Jul 16 2018
mark created T199676: Community Relations support for the 2018 data center switchover.
Jul 16 2018, 11:09 AM · CommRel-Specialists-Support (Jan-Mar-2019), Goal, User-Johan, SRE
Jul 11 2018
mark moved T177961: Upgrade LVS servers to stretch from Backlog to In Progress on the Pybal board.
Jul 11 2018, 2:46 PM · Patch-For-Review, Traffic, Pybal, SRE
mark moved T189290: Tune systemd journal rate limiting for PyBal from Backlog to In Progress on the Pybal board.
Jul 11 2018, 2:46 PM · Traffic-Icebox, Patch-For-Review, SRE, Pybal
mark closed T157786: Unhandled error stopping pybal: 'RunCommandMonitoringProtocol' object has no attribute 'checkCall' as Resolved.
This has been addressed in acdd0ebf74e5dd9e06c3216b9a93063ab8e91574
Jul 11 2018, 2:45 PM · Traffic, SRE, Pybal
mark moved T192437: Pybal support of configuration from the kubernetes API from Backlog to In Progress on the Pybal board.
Jul 11 2018, 2:42 PM · Traffic-Icebox, SRE, Prod-Kubernetes, Pybal
mark added a comment to T184293: rack/setup/install lvs101[3-6].
In T184293#4415745, @mark wrote:
On asw2-c-eqiad, interface-range LVS-balancer explicitly adds the private vlan, whereas on at least asw2-d-eqiad it does not. It probably doesn't matter since it also sets the "native vlan" id for the private vlan, but good to be aware of.
Jul 11 2018, 1:32 PM · SRE, Traffic
mark added a comment to T184293: rack/setup/install lvs101[3-6].
In T184293#4415691, @Vgutierrez wrote:
@ayounsi could you enable lvs1015 network ports? thanks!
Jul 11 2018, 1:19 PM · SRE, Traffic
mark added a comment to T184715: pybal's "can-depool" logic only takes downServers into account.
We had a long and interesting discussion about this on IRC.
Jul 11 2018, 12:51 PM · Traffic-Icebox, Pybal, SRE
Jul 10 2018
mark added a project to T196547: [Epic] Extension:JADE scalability concerns: DBA.
Jul 10 2018, 4:46 PM · Epic, DBA, Machine-Learning-Team (Active Tasks), User-Joe, SRE, Jade
Content licensed under Creative Commons Attribution-ShareAlike 3.0 (CC-BY-SA) unless otherwise noted; code licensed under GNU General Public License (GPL) or other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL