Page MenuHomePhabricator

limit the impact of heavy/large graphite queries
Closed, ResolvedPublic

Description

as discovered in a recent graphite outage, heavy/large queries can occupy all uwsgi workers, resulting in 502s. we should seek how to limit the impact of such queries, ideally with timeouts at the graphite-web level

Event Timeline

fgiunchedi raised the priority of this task from to Medium.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added projects: SRE, observability.
fgiunchedi subscribed.

See also T155872: graphite1003 short of available RAM for a case where heavy queries were not impacting uwsgi but carbon-cache instead using a lot of memory.

Change 494620 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] add uwsgi worker timeouts + max RSS for graphite

https://gerrit.wikimedia.org/r/494620

Change 494620 merged by CDanis:
[operations/puppet@production] graphite: uwsgi workers: set timeouts + max RSS

https://gerrit.wikimedia.org/r/494620

CDanis claimed this task.
CDanis subscribed.

Just saw the new timeout work -- query returned a 500 status after ~60 seconds. Boldly going to call this resolved; of course reopen if there's still more to be done.