Page MenuHomePhabricator

Request increased quota for integration Cloud VPS project
Closed, ResolvedPublic

Description

Project Name: integration
Type of quota increase requested: disk IO rate limit
Reason:

The integration project holds Docker daemon used for the CI workflow. Some of the build creates a lot of files on disk which takes a fairly long time (minutes) to reap off when deleting the container. It got first noticed immediately after the migration to Ceph in October 2020.

The default WMCS limits are:

quota:disk_read_iops_sec='5000'
quota:disk_total_bytes_sec='200000000'
quota:disk_write_iops_sec='500'

@Andrew created us a new flavor with 4 times the limits (I don't have access to the exact limits).

We still have disk slowness which are noticeable whenever doing heavy write operations. From a conversation with @aborrero
and @dcaro this morning, it seems the limits are easy to raise and there is room to raise them hence this task.

I don't know which limits will be fine. Last time I have created a Grafana board representing the IO latency and operations per seconds which can help track progress: https://grafana-labs.wikimedia.org/d/Yj81kH2Gk/cloud-project-io-metrics

We can probably migrate half of the instance and compare how things get improved.

From T266777#6598396 there is another Qemu parameter to allow burst limits:

iops_max=bm,iops_rd_max=rm,iops_wr_max=wm
Specify bursts in requests per second, either for all request types or for reads or writes only. Bursts allow the guest I/O to spike above the limit temporarily.

That one is in OpenStack rocky / Nova 18.0.0+ and is exposed as disk_write_iops_sec_max

Maybe worth investigating on top of the existing limitations.

Event Timeline

Repeating here:
I've added a burst configuration for 3x the iops for 10s for that flavor, might require stopping and starting (not restarting) the VMs:

root@cloudvirt1024:~# openstack --os-project-id integration flavor show g2.cores8.ram24.disk80.4xiops
+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

FieldValue

+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

OS-FLV-DISABLED:disabledTrue
OS-FLV-EXT-DATA:ephemeral0
access_project_idsintegration
disk80
idc86647d3-d157-43fa-9849-2bd16ef27a5f
nameg2.cores8.ram24.disk80.4xiops
os-flavor-access:is_publicFalse
propertiesaggregate_instance_extra_specs:ceph='true', quota:disk_read_iops_sec='20000', quota:disk_total_bytes_sec='800000000', quota:disk_write_iops_sec='2000', quota:disk_write_iops_sec_max='60000', quota:disk_write_iops_sec_max_length='10'
ram24576
rxtx_factor1.0
swap
vcpus8

+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Can you try that and report back if there was any change?

Mentioned in SAL (#wikimedia-releng) [2022-01-14T13:49:34Z] <hashar> Restarting all CI Docker agents via Horizon to apply new flavor settings T265615 T299211

aborrero triaged this task as Medium priority.
aborrero moved this task from Inbox to Approved on the Cloud-VPS (Quota-requests) board.

Looks like the burst quota addition ( quota:disk_write_iops_sec_max='6000', quota:disk_write_iops_sec_max_length='10' ) has helped. I will report back / reopen if we need to tune it further. Thank you for the quick act.

From our discussion earlier today, I will investigate whether we can use tmpfs to avoid useless disk IO (since CI dish out everything once the build has completed).