Page MenuHomePhabricator

Stop using puppet + git pull for auto deployment of schema repos
Open, LowPublic

Description

We currently have 4 different hosts for schema.wikimedia.org, 2 in each main DC. We have 2 different git schema repositories, primary and secondary. These git repositories are deployed and updated by puppet using a simple git pull on each host.

Puppet runs at random times. If a single schema host gets updated before another, some services that use schemas might begin using the a new schema from that host, while another that requires that schema would end up being routed to a host on which that schema doesn't yet exist.

This just occurred and caused the produce_canary_events to job to fail. This patch was merged. It apparently was pulled on one of the schema hosts. The produce canary jobs was routed to that host, saw the new 1.2.0 version of the schema and produced a new 1.2.0 canary event to eventgate-analytics-external. eventgate-analytics-external then tried to look up version 1.2.0, but was routed to a schema host that did not yet have this schema pulled. This caused the event to fail being produced. By the time of the next run of produce_canary_events (15 minutes later), 1.2.0 existed everywhere and the event was produced successfully.

I'm not sure what deployment system is best to use for this these days. I really don't want to have to take manual steps to deploy schema changes from the secondary schema repository. We should sync with Release Engineering about this.

Event Timeline

I tagged RelEng here for advice.

I want a merge in gerrit to trigger a deployment of repository, basically just a git pull on some hosts. I don't want users to have to log into deploy1001.eqiad.wmnet and run scap deploy. Do we have anything in place that can do this?

Thanks!

I tagged RelEng here for advice.

I want a merge in gerrit to trigger a deployment of repository, basically just a git pull on some hosts. I don't want users to have to log into deploy1001.eqiad.wmnet and run scap deploy. Do we have anything in place that can do this?

Thanks!

Not currently. I think this is a current ongoing discussion as part of the pipeline work. There was some movement in that direction with T214158: Experiment with continuous deployment using Blubberoid, but I believe there was a meeting talking about this last week and the current consensus is that there are still missing bits/it's not desirable currently. @jeena might be able to speak more about what's happening in this space.

I would suggest going in a slightly different direction than described in the task description. The race condition described is going to affect any deployment mechanism, including manual scap. A trivial example is that deployment might crash halfway through the process, leaving some servers with one version and others with another.

Instead of more tightly controlling the schema deployment process, you could make your produce_canary_events job more robust to heterogeneous cluster state. One alternative is to automatically deploy new schemas, but manually deploy changes to the producer job and include an explicit version number for the canary event schema. Another alternative is to monitor cluster state, either setting a dirty flag when different versions are deployed at once, or scanning all 4 x 2 endpoints to find the maximum supported version of the canary schema.