Page MenuHomePhabricator

Investigate/prototype ceph backup options
Closed, ResolvedPublic

Description

In the near-term we're only going to put truly disposable 'cattle' instances on ceph. In the meantime, though, we should come up with some sort of backup/restore process.

It's true that we currently have no backups for VMs, but our current failure case is losing one Hypervisor worth of VMs, whereas with ceph we now run the risk of losing the whole cloud if ceph freaks out.

Quick summary of most recent conversation:

  • We probably want to use Backy2 for this. We might also use Benji; it has fancier compression but is a younger project.
  • For proof-of-concept (and possibly near-term production) we'll use cloudstore1008/9.
    • For full-scale backups we probably need new hardware, but will learn more about storage needs as we go.
  • Some users (e.g. https://www.reddit.com/r/ceph/comments/61nmfv/how_is_anyone_doing_backups_on_cephrbd/) have had trouble with Ceph freezing when capturing snapshots for backup.
    • For starters we're going to hope that that isn't a problem for us; if it is then we'll have to consider creating a mirrored cluster just for backup purposes.
      • Possibly that mirror can have only one replica rather than three, which might push it into affordability

For the first round of tests/experiments, I'd like to answer these questions:

  • Does the upstream backy .deb install on Buster?
  • Can we do this using local storage on cloudstores, or do we need it on NFS?
  • What are some rough numbers for how big a backup image is, relative to initial VM size?
    • Same question for incremental backups
  • Does Ceph misbehave for our users during the backup process?

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+4 -7
operations/puppetproduction+1 -1
operations/puppetproduction+34 -15
operations/puppetproduction+18 -1
operations/puppetproduction+2 -0
operations/puppetproduction+26 -0
operations/puppetproduction+23 -9
operations/puppetproduction+8 -3
operations/puppetproduction+2 -2
operations/puppetproduction+233 -8
operations/puppetproduction+13 -5
operations/dnsmaster+4 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+70 -29
operations/puppetproduction+8 -0
operations/puppetproduction+10 -5
operations/puppetproduction+8 -0
operations/puppetproduction+14 -5
operations/puppetproduction+5 -3
operations/puppetproduction+397 -0
Show related patches Customize query in gerrit

Event Timeline

This is the slide deck from OVH at FOSDEM about how they ended up with ceph backing up to ceph https://archive.fosdem.org/2018/schedule/event/backup_ceph_at_scale/attachments/slides/2671/export/events/attachments/backup_ceph_at_scale/slides/2671/slides.pdf

It's good for reference because it describes their successes and failures in multiple backup system attempts.

Thre are some, largely unhelpful, recent discussions about that talk (as well as the link to the video) here https://www.reddit.com/r/ceph/comments/cznqoz/ceph_whole_cluster_backuprestore/

The basic long and short of it is that there's some things people are doing to save a bit by using radosgw, which doesn't give you as fast a hotswap cluster, but there it is. This also helps emphasize the not-that-great state of backups in ceph.

Also, if any solution requires a backup daemon set up by us, and we don't want to just write a python service, we could do something like how bacula does this: http://wiki.bacula.org/doku.php?id=application_specific_backups and use bacula since that's already a thing at the foundation, right?

Andrew triaged this task as High priority.Jul 30 2020, 3:23 PM

Change 617841 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Add Backy2 module and profile

https://gerrit.wikimedia.org/r/617841

Thre are some, largely unhelpful, recent discussions about that talk (as well as the link to the video) here https://www.reddit.com/r/ceph/comments/cznqoz/ceph_whole_cluster_backuprestore/

The basic long and short of it is that there's some things people are doing to save a bit by using radosgw, which doesn't give you as fast a hotswap cluster, but there it is. This also helps emphasize the not-that-great state of backups in ceph.

Sorry for just dropping the message, but I though it might be interesting.
In that thread they also point out some other talks at Cephalocon, that might be interesting too. This one compares several ways of doing backups: https://static.sched.com/hosted_files/cephalocon2019/58/ceph2ceph-presentation169.pdf
From: https://ceph.io/cephalocon/barcelona-2019/

Change 617841 merged by Andrew Bogott:
[operations/puppet@production] Add Backy2 module and profile

https://gerrit.wikimedia.org/r/617841

Change 618842 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] backy2: fix up some dependency issues in install

https://gerrit.wikimedia.org/r/618842

Change 618842 merged by Andrew Bogott:
[operations/puppet@production] backy2: fix up some dependency issues in install

https://gerrit.wikimedia.org/r/618842

Change 618849 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Introduce role::wmcs::ceph::backup

https://gerrit.wikimedia.org/r/618849

Change 618853 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Introduce role::wmcs::ceph::backup

https://gerrit.wikimedia.org/r/618853

Change 618854 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Retool cloudvirt1004 and cloudvirt1006 as ceph/backy2 test hosts

https://gerrit.wikimedia.org/r/618854

Change 618849 abandoned by Andrew Bogott:
[operations/puppet@production] Introduce role::wmcs::ceph::backup

Reason:

https://gerrit.wikimedia.org/r/618849

Change 618853 merged by Andrew Bogott:
[operations/puppet@production] Introduce role::wmcs::ceph::backup

https://gerrit.wikimedia.org/r/618853

Change 618854 merged by Andrew Bogott:
[operations/puppet@production] Retool cloudvirt1004 and cloudvirt1006 as ceph/backy2 test hosts

https://gerrit.wikimedia.org/r/618854

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1004.eqiad.wmnet', 'cloudvirt1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008062058_andrew_22313.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1004.eqiad.wmnet', 'cloudvirt1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008062120_andrew_10703.log.

Completed auto-reimage of hosts:

['cloudvirt1006.eqiad.wmnet', 'cloudvirt1004.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1006.eqiad.wmnet', 'cloudvirt1004.eqiad.wmnet']

Change 618875 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy: add a bunch of keys needed for the ceph client config

https://gerrit.wikimedia.org/r/618875

Change 618875 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy: add a bunch of keys needed for the ceph client config

https://gerrit.wikimedia.org/r/618875

Change 618876 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph: split out rbd client profiles

https://gerrit.wikimedia.org/r/618876

Change 618876 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph: split out rbd client profiles

https://gerrit.wikimedia.org/r/618876

Change 618878 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy: remove reference to 'nova'

https://gerrit.wikimedia.org/r/618878

Change 618878 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy: remove reference to 'nova'

https://gerrit.wikimedia.org/r/618878

Change 618879 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backup: remove another nova-specific ref

https://gerrit.wikimedia.org/r/618879

Change 618879 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backup: remove another nova-specific ref

https://gerrit.wikimedia.org/r/618879

Change 618995 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/dns@master] Added ipv6 addresses for cloudvirt1004 and cloudvir1006

https://gerrit.wikimedia.org/r/618995

Change 618995 merged by Andrew Bogott:
[operations/dns@master] Added ipv6 addresses for cloudvirt1004 and cloudvir1006

https://gerrit.wikimedia.org/r/618995

In order to stand up the initial mysql db, need to apply this by hand before running initdb:

https://github.com/wamdam/backy2/pull/32/commits/589baa5d24abe0f88a8c430d66513386d83f4b13

In order to stand up the initial mysql db, need to apply this by hand before running initdb:

https://github.com/wamdam/backy2/pull/32/commits/589baa5d24abe0f88a8c430d66513386d83f4b13

Good grief. At least python doesn't need a recompile?

Change 619011 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/backy2/ceph: add admin keyring so backy can access things

https://gerrit.wikimedia.org/r/619011

Change 619011 merged by Andrew Bogott:
[operations/puppet@production] wmcs/backy2/ceph: add admin keyring so backy can access things

https://gerrit.wikimedia.org/r/619011

Change 619350 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy: add basic backup script, wmcs-backup-instances

https://gerrit.wikimedia.org/r/619350

Change 619350 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy: add basic backup script, wmcs-backup-instances

https://gerrit.wikimedia.org/r/619350

Change 619486 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy fix name of backup script

https://gerrit.wikimedia.org/r/619486

Change 619486 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy fix name of backup script

https://gerrit.wikimedia.org/r/619486

With a very limited sample set, it takes about 20 minute per VM to run a backup with the current setup. There's no noticeable performance improvement for incremental backups.

I don't know what the bottleneck is but I'm guessing network throttling.

Change 620102 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy2 specify expiration for backups

https://gerrit.wikimedia.org/r/620102

Change 620103 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy: add timer to run a backup job and a cleanup job once per day

https://gerrit.wikimedia.org/r/620103

Change 620106 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Remove backup role from cloudvirt1004

https://gerrit.wikimedia.org/r/620106

Change 620106 merged by Andrew Bogott:
[operations/puppet@production] Remove backup role from cloudvirt1004

https://gerrit.wikimedia.org/r/620106

Change 620102 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy2 specify expiration for backups

https://gerrit.wikimedia.org/r/620102

Change 620103 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy: add timer to run a backup job and a cleanup job once per day

https://gerrit.wikimedia.org/r/620103

Change 620107 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy2: use 'root' user to run backups

https://gerrit.wikimedia.org/r/620107

Change 620107 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy2: use 'root' user to run backups

https://gerrit.wikimedia.org/r/620107

Change 620110 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy2: move our cleanup logic into a script

https://gerrit.wikimedia.org/r/620110

Change 620110 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy2: move our cleanup logic into a script

https://gerrit.wikimedia.org/r/620110

I have daily backups running now on cloudvirt1006. Right now the backups will expire after three days; next week I'll try restoring from an automatically-created backup.

Change 620758 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs-backup-instances: add a dict of regexps to exclude servers from backup

https://gerrit.wikimedia.org/r/620758

Change 620758 merged by Andrew Bogott:
[operations/puppet@production] wmcs-backup-instances: add a dict of regexps to exclude servers from backup

https://gerrit.wikimedia.org/r/620758

Change 620939 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] backy2: permit cleanup of images after 3 days

https://gerrit.wikimedia.org/r/620939

Change 620939 merged by Andrew Bogott:
[operations/puppet@production] backy2: permit cleanup of images after 3 days

https://gerrit.wikimedia.org/r/620939

Does the upstream backy .deb install on Buster?

Yes. It needs one by-hand patch to work with mysql, and another to fix the 'du' subcommand.

https://github.com/wamdam/backy2/pull/32
https://github.com/wamdam/backy2/pull/72

The first of those might not matter since performance is fine with a local sqlite backend.

Can we do this using local storage on cloudstores, or do we need it on NFS?

No need for NFS.

What are some rough numbers for how big a backup image is, relative to initial VM size?

It's hard to be general about this, but an xlarge (mostly empty) k8s worker node (with 15Gib on disk) takes up about 3GiB as a backup image. Other smaller (also mostly empty) VMs are producing backup images

Same question for incremental backups

There's no real difference in storage size between incremental and full backups; most of the gains seem to be coming from deduping.

Does Ceph misbehave for our users during the backup process?

Nothing obvious so far but it will be hard to know until we're under real-world load.

If we take 3GiB as an upper bound for per-VM average use, we're looking at 600VMs x 7 backups x 3 GiB = 13Tb storage for a week of daily backups. Definitely workable! I think we should go ahead with putting this into production; if those numbers turn out to be way off we can compensate by reducing the number of backups; even if we have to drop it down to 2 or 3 days it'll still be a lot better than nothing.

We should also be aggressive at excluding easily-reproducible VMs from backups. https://gerrit.wikimedia.org/r/c/operations/puppet/+/620758 provides a framework for that.

Change 621107 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudvirt1006.eqiad.wmnet: move to role::spare::system

https://gerrit.wikimedia.org/r/621107

Change 621107 merged by Andrew Bogott:
[operations/puppet@production] cloudvirt1006.eqiad.wmnet: move to role::spare::system

https://gerrit.wikimedia.org/r/621107

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1004.eqiad.wmnet', 'cloudvirt1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008190111_andrew_25707.log.

Completed auto-reimage of hosts:

['cloudvirt1004.eqiad.wmnet', 'cloudvirt1006.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1004.eqiad.wmnet', 'cloudvirt1006.eqiad.wmnet']

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1004.eqiad.wmnet', 'cloudvirt1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008190251_andrew_8799.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1004.eqiad.wmnet', 'cloudvirt1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008190310_andrew_12857.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1004.eqiad.wmnet', 'cloudvirt1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008190321_andrew_14338.log.

Completed auto-reimage of hosts:

['cloudvirt1004.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1006.eqiad.wmnet']