Various bits of ratelimiting in Traffic's VCL have a notion of if a request is originating from a large public cloud (for example Amazon EC2, Google Compute Engine, etc). This is maintained by Traffic/SRE in the private Puppet repo as a list of IP blocks named public_cloud_nets (created by combining lists published by cloud providers under various APIs).
It would be useful to have this Boolean attached in the Analytics data (in the webrequest Hive table, in Turnilo, in the Kafka stream that winds up on centrallog1001 etc) for cursory glances at traffic sources, verifying that ratelimiting logic is working, etc.
In this case it seems to make the most sense for Traffic to provide this tag directly: we don't want to have to sync Analytics's view vs Traffic's view of a list of cloud nets or ASNs, etc, or to reason about either of those lists diverging in the time domain. What we really want to know is "did the cache categorize this request as public_cloud_nets at the time it processed it?" So we should just provide it directly.
So I suggest we add a new public_cloud key to the X-Analytics header. Following existing convention for Booleans, this can be set to a value of 1 iff true, and otherwise not present.
- Begin emitting the new key in analytics.inc.vcl.erb in sub analytics_deliver
[x ] Add support for the field in Druid's data load of webrequest, so it becomes visible in turnilo (nothing to be done in the webrequest table, x-analytics is a map)