Stop using puppet + git pull for auto deployment of schema repos
Open, LowPublic
Actions

Assigned To

None

Authored By

	Ottomata
	Feb 16 2021, 4:03 PM

Description

We currently have 4 different hosts for schema.wikimedia.org, 2 in each main DC. We have 2 different git schema repositories, primary and secondary. These git repositories are deployed and updated by puppet using a simple git pull on each host.

Puppet runs at random times. If a single schema host gets updated before another, some services that use schemas might begin using the a new schema from that host, while another that requires that schema would end up being routed to a host on which that schema doesn't yet exist.

This just occurred and caused the produce_canary_events to job to fail. This patch was merged. It apparently was pulled on one of the schema hosts. The produce canary jobs was routed to that host, saw the new 1.2.0 version of the schema and produced a new 1.2.0 canary event to eventgate-analytics-external. eventgate-analytics-external then tried to look up version 1.2.0, but was routed to a schema host that did not yet have this schema pulled. This caused the event to fail being produced. By the time of the next run of produce_canary_events (15 minutes later), 1.2.0 existed everywhere and the event was produced successfully.

I'm not sure what deployment system is best to use for this these days. I really don't want to have to take manual steps to deploy schema changes from the secondary schema repository. We should sync with Release Engineering about this.

Related Objects

Mentioned In: T347421: [NEEDS GROOMING] schema services should be moved to k8s
T282832: Trusted-Contributors have +2 over schemas/event/secondary
T280017: Deploy schema repos to analytics cluster and use local uris for analytics jobs
Mentioned Here: T214158: Experiment with continuous deployment using Blubberoid

Event Timeline

Ottomata created this task.Feb 16 2021, 4:03 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 16 2021, 4:03 PM

I tagged RelEng here for advice.

I want a merge in gerrit to trigger a deployment of repository, basically just a git pull on some hosts. I don't want users to have to log into deploy1001.eqiad.wmnet and run scap deploy. Do we have anything in place that can do this?

Thanks!

In T274901#6834014, @Ottomata wrote:

I tagged RelEng here for advice.

I want a merge in gerrit to trigger a deployment of repository, basically just a git pull on some hosts. I don't want users to have to log into deploy1001.eqiad.wmnet and run scap deploy. Do we have anything in place that can do this?

Thanks!

Not currently. I think this is a current ongoing discussion as part of the pipeline work. There was some movement in that direction with T214158: Experiment with continuous deployment using Blubberoid, but I believe there was a meeting talking about this last week and the current consensus is that there are still missing bits/it's not desirable currently. @jeena might be able to speak more about what's happening in this space.

thcipriani edited projects, added Release-Engineering-Team-TODO; removed Release-Engineering-Team.Feb 22 2021, 3:58 PM

thcipriani moved this task from Should be empty (use Release-Engineering-Team) to Watching/External on the Release-Engineering-Team-TODO board.

RhinosF1 subscribed.Mar 1 2021, 3:31 PM

• razzi moved this task from Incoming to Event Platform on the Analytics board.Mar 4 2021, 5:36 PM

Ottomata mentioned this in T280017: Deploy schema repos to analytics cluster and use local uris for analytics jobs.Apr 13 2021, 12:44 PM

thcipriani edited projects, added Release-Engineering-Team (Radar); removed Release-Engineering-Team-TODO.Apr 20 2021, 3:33 AM

thcipriani moved this task from Limbo to Watching/External on the Release-Engineering-Team (Radar) board.Apr 20 2021, 3:34 AM

Ottomata mentioned this in T282832: Trusted-Contributors have +2 over schemas/event/secondary.May 14 2021, 9:08 PM

Ottomata triaged this task as Low priority.Jun 11 2021, 6:19 PM

Aklapper added a project: Data-Engineering.Feb 9 2023, 8:19 PM

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Feb 10 2023, 12:38 PM

• EChetty moved this task from Backlog to Event Platform on the Data-Engineering-Planning board.Feb 10 2023, 12:44 PM

I would suggest going in a slightly different direction than described in the task description. The race condition described is going to affect any deployment mechanism, including manual scap. A trivial example is that deployment might crash halfway through the process, leaving some servers with one version and others with another.

Instead of more tightly controlling the schema deployment process, you could make your produce_canary_events job more robust to heterogeneous cluster state. One alternative is to automatically deploy new schemas, but manually deploy changes to the producer job and include an explicit version number for the canary event schema. Another alternative is to monitor cluster state, either setting a dirty flag when different versions are deployed at once, or scanning all 4 x 2 endpoints to find the maximum supported version of the canary schema.

JArguello-WMF removed a project: Data-Engineering-Planning.Jun 29 2023, 9:51 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptJun 29 2023, 9:51 PM

JArguello-WMF moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.Jun 29 2023, 10:33 PM

JArguello-WMF added a project: Data Engineering and Event Platform Team.Jun 30 2023, 4:31 PM

JArguello-WMF moved this task from Data Eng Backlog to Event Platform Backlog on the Data Engineering and Event Platform Team board.Jun 30 2023, 4:38 PM

gmodena mentioned this in T347421: [NEEDS GROOMING] schema services should be moved to k8s.Sep 27 2023, 11:55 AM

lbowmaker removed a project: Data Engineering and Event Platform Team.Nov 10 2023, 2:29 PM