Investigate/prototype ceph backup options
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Jul 29 2020, 8:11 PM

Description

In the near-term we're only going to put truly disposable 'cattle' instances on ceph. In the meantime, though, we should come up with some sort of backup/restore process.

It's true that we currently have no backups for VMs, but our current failure case is losing one Hypervisor worth of VMs, whereas with ceph we now run the risk of losing the whole cloud if ceph freaks out.

Quick summary of most recent conversation:

We probably want to use Backy2 for this. We might also use Benji; it has fancier compression but is a younger project.

For proof-of-concept (and possibly near-term production) we'll use cloudstore1008/9.
- For full-scale backups we probably need new hardware, but will learn more about storage needs as we go.

Some users (e.g. https://www.reddit.com/r/ceph/comments/61nmfv/how_is_anyone_doing_backups_on_cephrbd/) have had trouble with Ceph freezing when capturing snapshots for backup.
- For starters we're going to hope that that isn't a problem for us; if it is then we'll have to consider creating a mirrored cluster just for backup purposes.
  - Possibly that mirror can have only one replica rather than three, which might push it into affordability

For the first round of tests/experiments, I'd like to answer these questions:

Does the upstream backy .deb install on Buster?
Can we do this using local storage on cloudstores, or do we need it on NFS?
What are some rough numbers for how big a backup image is, relative to initial VM size?
- Same question for incremental backups
Does Ceph misbehave for our users during the backup process?

Details

Subject	Repo	Branch	Lines +/-
cloudvirt1006.eqiad.wmnet: move to role::spare::system	operations/puppet	production	+4 -7
backy2: permit cleanup of images after 3 days	operations/puppet	production	+1 -1
wmcs-backup-instances: add a dict of regexps to exclude servers from backup	operations/puppet	production	+34 -15
wmcs/ceph/backy2: move our cleanup logic into a script	operations/puppet	production	+18 -1
wmcs/ceph/backy2: use 'root' user to run backups	operations/puppet	production	+2 -0
wmcs/ceph/backy: add timer to run a backup job and a cleanup job once per day	operations/puppet	production	+26 -0
wmcs/ceph/backy2 specify expiration for backups	operations/puppet	production	+23 -9
Remove backup role from cloudvirt1004	operations/puppet	production	+8 -3
wmcs/ceph/backy fix name of backup script	operations/puppet	production	+2 -2
wmcs/ceph/backy: add basic backup script, wmcs-backup-instances	operations/puppet	production	+233 -8
wmcs/backy2/ceph: add admin keyring so backy can access things	operations/puppet	production	+13 -5
Added ipv6 addresses for cloudvirt1004 and cloudvir1006	operations/dns	master	+4 -0
wmcs/ceph/backup: remove another nova-specific ref	operations/puppet	production	+1 -1
wmcs/ceph/backy: remove reference to 'nova'	operations/puppet	production	+1 -1
wmcs/ceph: split out rbd client profiles	operations/puppet	production	+70 -29
wmcs/ceph/backy: add a bunch of keys needed for the ceph client config	operations/puppet	production	+8 -0
Retool cloudvirt1004 and cloudvirt1006 as ceph/backy2 test hosts	operations/puppet	production	+10 -5
Introduce role::wmcs::ceph::backup	operations/puppet	production	+8 -0
Introduce role::wmcs::ceph::backup	operations/puppet	production	+14 -5
backy2: fix up some dependency issues in install	operations/puppet	production	+5 -3
Add Backy2 module and profile	operations/puppet	production	+397 -0

Related Objects
Search...

Status	Assigned	Task
Resolved	taavi	T211393 openstack-browser and horizon: Security group and floating IP quota information being pulled from Nova instead of Neutron for eqiad1-r
Resolved	Andrew	T211777 Can't get quota information from Neutron API
Resolved	Andrew	T261137 upgrade cloud-vps openstack to Openstack version 'Victoria'
Resolved	dcaro	T261136 upgrade cloud-vps openstack to Openstack version 'Ussuri'
Resolved	Andrew	T261138 Upgrade Horizon to latest OpenStack release
Resolved	Andrew	T261135 upgrade cloud-vps openstack to Openstack version 'Train'
Resolved	Andrew	T261134 upgrade cloud-vps openstack to Openstack version 'Stein'
Resolved	Andrew	T259399 Upgrade cloudvirts to Debian Buster
Resolved	dcaro	T216195 Move cloudvirt hosts to 10Gb ethernet
Resolved	Andrew	T194334 [Epic] Modern Cloud VPS storage layer
Resolved	Andrew	T261132 Move all cloud-vps VMs to Ceph
Resolved	Andrew	T253365 Complete build out of Ceph cluster and attach "diskless" cloudvirts
Resolved	Andrew	T259192 Investigate/prototype ceph backup options

Event Timeline

Advocacy for Benji (at least in terms of storage space): https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/73KIDAUAXGBWI425PX6YSB7LLT2UUHQG/

This is the slide deck from OVH at FOSDEM about how they ended up with ceph backing up to ceph https://archive.fosdem.org/2018/schedule/event/backup_ceph_at_scale/attachments/slides/2671/export/events/attachments/backup_ceph_at_scale/slides/2671/slides.pdf

It's good for reference because it describes their successes and failures in multiple backup system attempts.

Thre are some, largely unhelpful, recent discussions about that talk (as well as the link to the video) here https://www.reddit.com/r/ceph/comments/cznqoz/ceph_whole_cluster_backuprestore/

The basic long and short of it is that there's some things people are doing to save a bit by using radosgw, which doesn't give you as fast a hotswap cluster, but there it is. This also helps emphasize the not-that-great state of backups in ceph.

Also, if any solution requires a backup daemon set up by us, and we don't want to just write a python service, we could do something like how bacula does this: http://wiki.bacula.org/doku.php?id=application_specific_backups and use bacula since that's already a thing at the foundation, right?

Andrew triaged this task as High priority.Jul 30 2020, 3:23 PM

Change 617841 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Add Backy2 module and profile

https://gerrit.wikimedia.org/r/617841

gerritbot added a project: Patch-For-Review.Aug 1 2020, 9:12 PM

In T259192#6346755, @Bstorm wrote:

Thre are some, largely unhelpful, recent discussions about that talk (as well as the link to the video) here https://www.reddit.com/r/ceph/comments/cznqoz/ceph_whole_cluster_backuprestore/

The basic long and short of it is that there's some things people are doing to save a bit by using radosgw, which doesn't give you as fast a hotswap cluster, but there it is. This also helps emphasize the not-that-great state of backups in ceph.

Sorry for just dropping the message, but I though it might be interesting.
In that thread they also point out some other talks at Cephalocon, that might be interesting too. This one compares several ways of doing backups: https://static.sched.com/hosted_files/cephalocon2019/58/ceph2ceph-presentation169.pdf
From: https://ceph.io/cephalocon/barcelona-2019/

Change 617841 merged by Andrew Bogott:
[operations/puppet@production] Add Backy2 module and profile

https://gerrit.wikimedia.org/r/617841

Maintenance_bot removed a project: Patch-For-Review.Aug 6 2020, 7:10 PM

Change 618842 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] backy2: fix up some dependency issues in install

https://gerrit.wikimedia.org/r/618842

gerritbot added a project: Patch-For-Review.Aug 6 2020, 8:06 PM

Change 618842 merged by Andrew Bogott:
[operations/puppet@production] backy2: fix up some dependency issues in install

https://gerrit.wikimedia.org/r/618842

Maintenance_bot removed a project: Patch-For-Review.Aug 6 2020, 8:11 PM

Change 618849 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Introduce role::wmcs::ceph::backup

https://gerrit.wikimedia.org/r/618849

gerritbot added a project: Patch-For-Review.Aug 6 2020, 8:40 PM

Change 618853 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Introduce role::wmcs::ceph::backup

https://gerrit.wikimedia.org/r/618853

Change 618854 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Retool cloudvirt1004 and cloudvirt1006 as ceph/backy2 test hosts

https://gerrit.wikimedia.org/r/618854

Change 618849 abandoned by Andrew Bogott:
[operations/puppet@production] Introduce role::wmcs::ceph::backup

Reason:

https://gerrit.wikimedia.org/r/618849

Change 618853 merged by Andrew Bogott:
[operations/puppet@production] Introduce role::wmcs::ceph::backup

https://gerrit.wikimedia.org/r/618853

Change 618854 merged by Andrew Bogott:
[operations/puppet@production] Retool cloudvirt1004 and cloudvirt1006 as ceph/backy2 test hosts

https://gerrit.wikimedia.org/r/618854

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1004.eqiad.wmnet', 'cloudvirt1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008062058_andrew_22313.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1004.eqiad.wmnet', 'cloudvirt1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008062120_andrew_10703.log.

Completed auto-reimage of hosts:

['cloudvirt1006.eqiad.wmnet', 'cloudvirt1004.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1006.eqiad.wmnet', 'cloudvirt1004.eqiad.wmnet']

Change 618875 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy: add a bunch of keys needed for the ceph client config

https://gerrit.wikimedia.org/r/618875

Change 618875 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy: add a bunch of keys needed for the ceph client config

https://gerrit.wikimedia.org/r/618875

Change 618876 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph: split out rbd client profiles

https://gerrit.wikimedia.org/r/618876

Change 618876 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph: split out rbd client profiles

https://gerrit.wikimedia.org/r/618876

Change 618878 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy: remove reference to 'nova'

https://gerrit.wikimedia.org/r/618878

Change 618878 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy: remove reference to 'nova'

https://gerrit.wikimedia.org/r/618878

Change 618879 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backup: remove another nova-specific ref

https://gerrit.wikimedia.org/r/618879

Change 618879 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backup: remove another nova-specific ref

https://gerrit.wikimedia.org/r/618879

Change 618995 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/dns@master] Added ipv6 addresses for cloudvirt1004 and cloudvir1006

https://gerrit.wikimedia.org/r/618995

Change 618995 merged by Andrew Bogott:
[operations/dns@master] Added ipv6 addresses for cloudvirt1004 and cloudvir1006

https://gerrit.wikimedia.org/r/618995

In order to stand up the initial mysql db, need to apply this by hand before running initdb:

https://github.com/wamdam/backy2/pull/32/commits/589baa5d24abe0f88a8c430d66513386d83f4b13

In T259192#6368699, @Andrew wrote:

In order to stand up the initial mysql db, need to apply this by hand before running initdb:

https://github.com/wamdam/backy2/pull/32/commits/589baa5d24abe0f88a8c430d66513386d83f4b13

Good grief. At least python doesn't need a recompile?

Change 619011 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/backy2/ceph: add admin keyring so backy can access things

https://gerrit.wikimedia.org/r/619011

Change 619011 merged by Andrew Bogott:
[operations/puppet@production] wmcs/backy2/ceph: add admin keyring so backy can access things

https://gerrit.wikimedia.org/r/619011

Change 619350 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy: add basic backup script, wmcs-backup-instances

https://gerrit.wikimedia.org/r/619350

Change 619350 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy: add basic backup script, wmcs-backup-instances

https://gerrit.wikimedia.org/r/619350

Change 619486 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy fix name of backup script

https://gerrit.wikimedia.org/r/619486

Change 619486 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy fix name of backup script

https://gerrit.wikimedia.org/r/619486

• nskaggs moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.Aug 11 2020, 4:34 PM

With a very limited sample set, it takes about 20 minute per VM to run a backup with the current setup. There's no noticeable performance improvement for incremental backups.

I don't know what the bottleneck is but I'm guessing network throttling.

Change 620102 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy2 specify expiration for backups

https://gerrit.wikimedia.org/r/620102

Change 620103 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy: add timer to run a backup job and a cleanup job once per day

https://gerrit.wikimedia.org/r/620103

Change 620106 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Remove backup role from cloudvirt1004

https://gerrit.wikimedia.org/r/620106

Change 620106 merged by Andrew Bogott:
[operations/puppet@production] Remove backup role from cloudvirt1004

https://gerrit.wikimedia.org/r/620106

Change 620102 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy2 specify expiration for backups

https://gerrit.wikimedia.org/r/620102

Change 620103 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy: add timer to run a backup job and a cleanup job once per day

https://gerrit.wikimedia.org/r/620103

Change 620107 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy2: use 'root' user to run backups

https://gerrit.wikimedia.org/r/620107

Change 620107 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy2: use 'root' user to run backups

https://gerrit.wikimedia.org/r/620107

Change 620110 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy2: move our cleanup logic into a script

https://gerrit.wikimedia.org/r/620110

Change 620110 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy2: move our cleanup logic into a script

https://gerrit.wikimedia.org/r/620110

I have daily backups running now on cloudvirt1006. Right now the backups will expire after three days; next week I'll try restoring from an automatically-created backup.

Change 620758 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs-backup-instances: add a dict of regexps to exclude servers from backup

https://gerrit.wikimedia.org/r/620758

Change 620758 merged by Andrew Bogott:
[operations/puppet@production] wmcs-backup-instances: add a dict of regexps to exclude servers from backup

https://gerrit.wikimedia.org/r/620758

Change 620939 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] backy2: permit cleanup of images after 3 days

https://gerrit.wikimedia.org/r/620939

Change 620939 merged by Andrew Bogott:
[operations/puppet@production] backy2: permit cleanup of images after 3 days

https://gerrit.wikimedia.org/r/620939

Does the upstream backy .deb install on Buster?

Yes. It needs one by-hand patch to work with mysql, and another to fix the 'du' subcommand.

https://github.com/wamdam/backy2/pull/32
https://github.com/wamdam/backy2/pull/72

The first of those might not matter since performance is fine with a local sqlite backend.

Can we do this using local storage on cloudstores, or do we need it on NFS?

No need for NFS.

What are some rough numbers for how big a backup image is, relative to initial VM size?

It's hard to be general about this, but an xlarge (mostly empty) k8s worker node (with 15Gib on disk) takes up about 3GiB as a backup image. Other smaller (also mostly empty) VMs are producing backup images

Same question for incremental backups

There's no real difference in storage size between incremental and full backups; most of the gains seem to be coming from deduping.

Does Ceph misbehave for our users during the backup process?

Nothing obvious so far but it will be hard to know until we're under real-world load.

If we take 3GiB as an upper bound for per-VM average use, we're looking at 600VMs x 7 backups x 3 GiB = 13Tb storage for a week of daily backups. Definitely workable! I think we should go ahead with putting this into production; if those numbers turn out to be way off we can compensate by reducing the number of backups; even if we have to drop it down to 2 or 3 days it'll still be a lot better than nothing.

We should also be aggressive at excluding easily-reproducible VMs from backups. https://gerrit.wikimedia.org/r/c/operations/puppet/+/620758 provides a framework for that.

Andrew mentioned this in T260692: Ceph VM image backups.Aug 18 2020, 2:22 PM

Change 621107 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudvirt1006.eqiad.wmnet: move to role::spare::system

https://gerrit.wikimedia.org/r/621107

Change 621107 merged by Andrew Bogott:
[operations/puppet@production] cloudvirt1006.eqiad.wmnet: move to role::spare::system

https://gerrit.wikimedia.org/r/621107

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1004.eqiad.wmnet', 'cloudvirt1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008190111_andrew_25707.log.

Completed auto-reimage of hosts:

['cloudvirt1004.eqiad.wmnet', 'cloudvirt1006.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1004.eqiad.wmnet', 'cloudvirt1006.eqiad.wmnet']

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1004.eqiad.wmnet', 'cloudvirt1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008190251_andrew_8799.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1004.eqiad.wmnet', 'cloudvirt1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008190310_andrew_12857.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1004.eqiad.wmnet', 'cloudvirt1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008190321_andrew_14338.log.

Completed auto-reimage of hosts:

['cloudvirt1004.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1006.eqiad.wmnet']

Andrew closed this task as Resolved.Aug 19 2020, 4:26 AM

Investigate/prototype ceph backup optionsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Investigate/prototype ceph backup options
Closed, ResolvedPublic
Actions

Related Objects
Search...