Page MenuHomePhabricator

Cirrussearch: spike in pool counter rejections related to full_text and entity_full_text queries
Closed, ResolvedPublic5 Estimated Story Points

Description

We're seeing ~10% of current requests being rejected by poolcounter: https://grafana-rw.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?editPanel=2&viewPanel=2&orgId=1&from=1623006099108&to=1623085934837

This is almost certainly due to the ~2x expected eqiad.full_text query volume and ~30x expected eqiad.entity_full_text query volume: https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?viewPanel=47&orgId=1&from=1623010929480&to=1623086544830

We saw an initial spike of 2x eqiad.full_text query volume without the corresponding increase in eqiad.entity_full_text. During that spike there were [almost] no pool counter rejections, so it seems that the eqiad.entity_full_text going up ~30x in the subsequent spike is likely what tipped us over the edge of our capacity. See https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?viewPanel=47&orgId=1&from=1623011239023&to=1623033287614 / https://grafana-rw.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?viewPanel=2&orgId=1&from=1623011239023&to=1623033287614 for just that initial spike

Event Timeline

Reviewing the mediawiki_cirrussearch_requests event log for 2020-06-07 10:00-11:00, the first full hour where things are failing, I found that almost all the requests that match entity_full_text, our 30x increase bucket, come from GAE ip address ranges (2600:1900.2000:*) and almost all include 'HeadlessChrome' in their user agent string.

Change 698607 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] temp limit GAE/GCE traffic towards search API

https://gerrit.wikimedia.org/r/698607

Change 698607 merged by CDanis:

[operations/puppet@production] temp limit GAE/GCE traffic towards search API

https://gerrit.wikimedia.org/r/698607

Mentioned in SAL (#wikimedia-operations) [2021-06-07T19:19:53Z] <cdanis> T284479 ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕞🍵 sudo cumin -b16 'A:cp-text' "run-puppet-agent"

Mentioned in SAL (#wikimedia-operations) [2021-06-07T19:21:18Z] <ryankemper> T284479 [Cirrussearch] We're working on rolling out https://gerrit.wikimedia.org/r/698607, which will ban search API requests that match the Google App Engine IP range 2600:1900::0/28 AND whose user agent includes HeadlessChrome

Mentioned in SAL (#wikimedia-operations) [2021-06-07T19:25:40Z] <ryankemper> T284479 [Cirrussearch] Seeing the expected drop in entity_full_text requests here: https://grafana-rw.wikimedia.org/d/000000455/elasticsearch-percentiles?viewPanel=47&orgId=1&from=now-12h&to=now As a result we're no longer rejecting any requests

Mentioned in SAL (#wikimedia-operations) [2021-06-07T19:30:56Z] <ryankemper> T284479 [Cirrussearch] We'll keep monitoring. For now this incident is resolved. Glancing at our current volume relative to what we'd expect, the numbers we see match what we'd expect. If we're accidentally banning any innocent requests they must be an incredibly small percentage of the total otherwise we'd see significantly lower volume than expected

It seems we could use some way to prevent automated requests from causing interactive requests to fail. One option could be to subdivide the CirrusSearch-Search pool counter into two, one for typical requests and one for requests that look like automation. This would allow automation to cap out without affecting interactive users. The heuristic could check for some common user agent substrings and/or check ip address against ranges known for bulk requests. It shouldn't be too hard to implement entirely from CirrusSearch, but something like maintaining a list of CIDR's for public clouds is unlikely to happen, meaning I expect if we maintain a list it will be updated in a reactionary manner. Traffic has some more info at T270391#7131736, but they aren't maintaining a full list yet either. Overall I'm not a huge fan of maintaining the 'request looks like automation' check in Cirrus, but if we could somehow leverage a more centralized definition this could be less reactionary.

Gehel triaged this task as Unbreak Now! priority.Jun 8 2021, 8:24 AM

Change 698849 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] Revert "temp limit GAE/GCE traffic towards search API"

https://gerrit.wikimedia.org/r/698849

Change 698849 merged by Ryan Kemper:

[operations/puppet@production] Revert "temp limit GAE/GCE traffic towards search API"

https://gerrit.wikimedia.org/r/698849

Mentioned in SAL (#wikimedia-operations) [2021-06-08T21:59:43Z] <ryankemper> T284479 Prior context: We put a block on a range of Google App Engine IPs yesterday to protect Cirrussearch from a bad actor; now we're going to try lifting the block and seeing if we're still getting slammed with traffic

Mentioned in SAL (#wikimedia-operations) [2021-06-08T22:03:49Z] <ryankemper> T284479 Successful puppet run on cp3052, proceeding to rest of A:cp-text: sudo cumin -b 15 'A:cp-text' 'run-puppet-agent -q'

Mentioned in SAL (#wikimedia-operations) [2021-06-08T22:09:27Z] <ryankemper> T284479 Puppet run complete across all of cp-text. Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?viewPanel=47&orgId=1&from=now-1h&to=now over the next few minutes to see if we see a large spike in full_text and entity_full_text queries

Mentioned in SAL (#wikimedia-operations) [2021-06-08T22:10:30Z] <ryankemper> T284479 Already starting to see a large upward spike in requests. Doing a quick sanity check to make sure this is out of the ordinary but I'll likely be putting the block back in place shortly

Mentioned in SAL (#wikimedia-operations) [2021-06-08T22:10:58Z] <ryankemper> T284479 Yup more than enough evidence of a strong upward spike now. Proceeding to revert

Change 698850 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] Revert "Revert "temp limit GAE/GCE traffic towards search API""

https://gerrit.wikimedia.org/r/698850

Change 698850 merged by Ryan Kemper:

[operations/puppet@production] Revert "Revert "temp limit GAE/GCE traffic towards search API""

https://gerrit.wikimedia.org/r/698850

Mentioned in SAL (#wikimedia-operations) [2021-06-08T22:15:47Z] <ryankemper> T284479 Successful puppet run on cp3052, proceeding to rest of A:cp-text: sudo cumin -b 19 'A:cp-text' 'run-puppet-agent -q'

Mentioned in SAL (#wikimedia-operations) [2021-06-08T22:21:56Z] <ryankemper> T284479 Block put back in place. We're back to expected traffic levels. We'll need a more granular mitigation in place before we can lift this block going forward.

MPhamWMF set the point value for this task to 5.Jun 9 2021, 3:40 PM

Change 699257 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] Add pool counter for automated search requests

https://gerrit.wikimedia.org/r/699257

Change 699311 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Add pool counter for automated requests

https://gerrit.wikimedia.org/r/699311

Change 699257 merged by jenkins-bot:

[operations/mediawiki-config@master] Add pool counter for automated search requests

https://gerrit.wikimedia.org/r/699257

Mentioned in SAL (#wikimedia-operations) [2021-06-21T18:30:57Z] <urbanecm@deploy1002> Synchronized wmf-config/PoolCounterSettings.php: af61f1a9ffc4703ad9514eac96655c3dd9d491d8: Add pool counter for automated search requests (T284479) (duration: 00m 59s)

Change 699311 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add pool counter for automated requests

https://gerrit.wikimedia.org/r/699311

Aklapper added a subscriber: dcausse.

@RKemper, @dcausse: This "Unbreak Now" priority ticket has not seen updates for five weeks. All linked patches have been merged.
Is there more to do here? If yes, should the priority be lowered? If not, can the task status be set to resolved?
Thanks in advance.

dcausse lowered the priority of this task from Unbreak Now! to High.Aug 2 2021, 7:32 AM

Lowering to High to reflect reality, the new config option added to MW should help to throttle automated requests like the ones we saw here if it happens again in the future.