We currently have 4 different hosts for schema.wikimedia.org, 2 in each main DC. We have 2 different git schema repositories, primary and secondary. These git repositories are deployed and updated by puppet using a simple git pull on each host.
Puppet runs at random times. If a single schema host gets updated before another, some services that use schemas might begin using the a new schema from that host, while another that requires that schema would end up being routed to a host on which that schema doesn't yet exist.
This just occurred and caused the produce_canary_events to job to fail. This patch was merged. It apparently was pulled on one of the schema hosts. The produce canary jobs was routed to that host, saw the new 1.2.0 version of the schema and produced a new 1.2.0 canary event to eventgate-analytics-external. eventgate-analytics-external then tried to look up version 1.2.0, but was routed to a schema host that did not yet have this schema pulled. This caused the event to fail being produced. By the time of the next run of produce_canary_events (15 minutes later), 1.2.0 existed everywhere and the event was produced successfully.
I'm not sure what deployment system is best to use for this these days. I really don't want to have to take manual steps to deploy schema changes from the secondary schema repository. We should sync with Release Engineering about this.