Page MenuHomePhabricator

Provide Speed & Function rough numbers for our current Gerrit web traffic
Closed, ResolvedPublic

Description

From meeting notes earlier today:

Q: Need anything that can reproduce traffic in POC environment - access logs? Is there anything about the traffic we can provide to S&F
A: Run clone in a loop to get git pull traffic
A from S&F: will assume 25-30 requests per second for POC environment
A: Web requests - standard browsing stuff... We'll look at this - post vs get requests - What's API, what's POSTs, what's GETs

Separating API from frontend traffic may be difficult, since the frontend talks to the same API, but a rough guess on POSTs vs. GETs should be available from Apache logs.

Event Timeline

So, doing some rough estimation here:

#!/bin/bash

# Summarize GET/POST traffic to Gerrit for a 2 week window.
#
# Assumptions:
#
#   - x-git-upload-pack-advertisement and x-git-upload-pack-result indicate
#     a fetch/pull
#
#   - probably hard to separate API traffic here since front-end UI uses the
#     same API

LOGROOT=/var/log/apache2

sudo touch ~/access.log
sudo bash -c "zcat $LOGROOT/gerrit.wikimedia.org.https.access.log.{15,14,13,12,11,10,9,8,7,6,5,4,3,2}.gz > /home/brennen/access.log"

printf 'Start\t\t'
sudo head -1 ~/access.log | cut -f1

printf 'GET\t\t'
sudo grep -c GET ~/access.log

printf 'GET no fetch\t'
sudo grep -v x-git-upload-pack ~/access.log | grep -c GET

printf 'POST\t\t'
sudo grep -c POST ~/access.log

printf 'POST no fetch\t'
sudo grep -v x-git-upload-pack ~/access.log | grep -c POST

printf "End\t\t"
sudo tail -1 ~/access.log | cut -f1

sudo rm ~/access.log

Running this on the primary machine, gerrit1001, gives:

brennen@gerrit1001:~$ bash ~/counts.sh
Start           2021-02-09T00:00:06
GET             7882907
GET no fetch    4986965
POST            4822877
POST no fetch   13682
End             2021-02-23T00:00:21

Running this on the replica, gerrit2001, gives:

brennen@gerrit2001:~$ bash ~/counts.sh 
Start           2021-02-09T00:00:03
GET             4440132
GET no fetch    15143
POST            4193437
POST no fetch   124
End             2021-02-23T00:00:01

Note that, minus the upload-pack traffic, there are very few requests here, which makes sense as I understand it since the replica mostly serves automated traffic rather than human users.

So on the primary box, I figure roughly:

  • GETs: 6.52/s
  • GETs, not including upload-pack: 4.12/s
  • POSTs: 3.99/s
  • POSTs, not including upload-pack: 0.01/s

We've got a bunch of public Grafana dashboards for Gerrit as well. Probably of interest:

cc: @OlyKalinichenkoSpeedAndFunction, @Eugene.chernov for Speed & Function - didn't find a Phabricator account for Sergey.

Also cc: @thcipriani in case skimming this reveals anything I'm wrong about.

greg triaged this task as Medium priority.Feb 24 2021, 5:36 PM
greg moved this task from Backlog to In Progress on the GitLab (Initialization) board.

Hi, I am the one more or less in charge of our Gerrit instance and have some experience with past scaling troubles we had.

+1 what @brennen wrote regarding the metrics (excellent digging).

The traffic comes from:

upload-pack over https or ssh are both captured on https://grafana.wikimedia.org/d/EV4ZCjEWz/git-fetch-clone-upstream?orgId=1&refresh=1m

We have several bots crawling all the repositories on a schedule and we have moved them to a Gerrit replica. But I think we can ignore that, a better system would be to replicate the repositories in real time to the apps that requires those instead of hammering the Gerrit replica. CI is the biggest fetcher (over https) mostly via shallow fetches (git fetch -d 2), that is taken in account in @brennen stats above.

Gerrit holds repositories pack files in memory (jgit block cache) which dramatically speed up operations. There is roughly 4GB used right now https://grafana.wikimedia.org/d/8YPId9hGz/jgit-block-cache and it has various other caches to avoid hitting the disk (delta diffs, tags, commit metadata etc).

Twist: we would most probably need support for git protocol v2 which saves git from sending every single references of a repository to the clients on each fetch (J199). But I digress.

For the web front traffic, it is hard to know, I would guess 25/30 requests per seconds is large enough for a POC. There are probably a lot of materials that can eventually be cached on the edge (varnish/ATS) though that is maybe out of the loop for the POC.

@Sergey.Trofimovsky.SF - this is followup from questions you raised during our first meeting the other day. Please let us know if anything else would be useful here.

Thank you, guys! That should be enough to start working on load tests, I'm sure we'll come up with more follow up questions on the way.

Thank you, guys! That should be enough to start working on load tests, I'm sure we'll come up with more follow up questions on the way.

Awesome! Let us know!

Reopening this for a request from @Sergey.Trofimovsky.SF:

Gitlab's resource usage for checkouts/clones seems to be dependent on repo sizes, this is why we want to include more realistically sized repos into the testing mix. Just verifying, if it's possible to provide a general idea of distribution by repo sizes, ideally accompanied by requests numbers? Thanks!

I think I can probably get a rough number for that by:

  1. Taking a look at sizes of all repos
  2. Cross-referencing repo names with http traffic? And maybe upload-pack operations generally?
brennen moved this task from In Progress to Done on the GitLab (Initialization) board.

Got this resolved in chat.

Log slicing, for future reference:

#!/bin/bash

grep '^@cee' access.log | \
  sed 's/^......//' | \
  jq -r '."url.path"' | \
  grep 'upload-pack' | \
  sed 's/^\/r\/\(.*\)\/git-upload-pack$/\1/' | \
  sort | uniq -c | sort -nr | \
  tee by-repo-http.log
cd /var/log/gerrit
zcat sshd_log.2*.gz | grep git-upload-pack | awk '{ print $5 }' | sort | uniq -c | sort -nr | tee ~/by-repo-ssh.log