rsync for LVM-managed block devices

Posted: Fri, 28 October 2011 | permalink | 23 Comments

If you’ve ever had to migrate a service to a new machine, you’ve probably found rsync to be a godsend. It’s ability to pre-sync most data while the service is still running, then perform the much quicker “sync the new changes” action after the service has been taken down is fantastic.

For a long time, I’ve wanted a similar tool for block devices. I’ve managed ridiculous numbers of VMs in my time, almost all stored in LVM logical volumes, and migrating them between machines is a downtime hassle. You need to shutdown the VM, do a massive dd | netcat, and then bring the machine back up. For a large disk, even over a fast local network, this can be quite an extended period of downtime.

The naive implementation of a tool that was capable of doing a block-device rsync would be to checksum the contents of the device, possibly in blocks, and transfer only the blocks that have changed. Unfortunately, as network speeds approach disk I/O speeds, this becomes a pointless operation. Scanning 200GB of data and checksumming it still takes a fair amount of time – in fact, it’s often nearly as quick to just send all the data as it is to checksum it and then send the differences.¹

No, a different approach is needed for block devices. We need something that keeps track of the blocks on disk that have changed since our initial sync, so that we can just transfer those changed blocks.

As it turns out, keeping track of changed blocks is exactly what LVM snapshots do. They actually keep a copy of what was in the blocks before it changed, but we’re not interested in that so much. No, what we want is the list of changed blocks, which is stored in a hash table on disk.

All that was missing was a tool that read this hash table to get the list of blocks that had changed, then sent them over a network to another program that was listening for the changes and could write them into the right places on the destination.

That tool now exists, and is called lvmsync. It is a slightly crufty chunk of ruby that, when given a local LV and a remote machine and block device, reads the snapshot metadata and transfers the changed blocks over an SSH connection it sets up.

Be warned: at present, it’s a pretty raw piece of code. It does nothing but the “send updated blocks over the network”, so you have to deal with the snapshot creation, initial sync, and so on. As time goes on, I’m hoping to polish it and turn it into something Very Awesome. “Patches Accepted”, as the saying goes.

rsync avoids a full-disk checksum because it cheats and uses file metadata (the last-modified time, or mtime of a file) to choose which files can be ignored. No such metadata is available for block devices (in the general case). ↩

23 Comments

From: rpetre
2011-10-28 11:55

Thank you so very much! This is something I’ve been sorely missing from my data-wrangling toolbox. It always baffled me why a similar “send snapshot to remote machine” utility hasn’t been already part of lvm2.

From: Korkman
2011-10-29 04:33

(commenting on the readme at github)

Some thoughts:

Keep it online

Add a second snapshot to your concept and you can transfer changed blocks online, with no downtime.

Take initial snapshot.
Transfer entire snapshot over the network.
Take a second snapshot.
Read data from second snapshot, according to blocks changed as listed by first snapshot.
Delete first snapshot. Second snapshot becomes first snapshot.
On next backup schedule, goto 3 and repeat.

Keeping a snapshot between backups is, of course, quite a performance hit so lvcreate chunksize should be rather large to minimize long-term impact.

Make it work for fast networks

If you didn’t do so already, let ssh use cipher arcfour or blowfish (in that order or preference) and make compression optional.

Or let the user define the pipe.

lvmsync LV_DATA LV_SNAP --send | ssh root@server "lvmsync LV_BACKUP --receive".

I used to use qpress compression and mbuffer over gigabit links, so that would transform to something like:

lvmsync LV_DATA LV_SNAP --send | mbuffer -M 256M | qpress | ssh root@server "qpress | lvmsync LV_BACKUP --receive".

Get rid of lvm snapshots

When a snapshot runs out of space, does it stop tracking changed blocks? If so, spinoff project could be a replacement for snapshots that doesn’t actually store anything (copy & paste from snapshots implementation, minus storage).

From: Matt Palmer
2011-10-29 08:51

rpetre: I’m not sure why it didn’t exist, either, given that it was surprisingly simple to implement. I guess it’s one of those “don’t realise you needed it until you have it” deals.

Korkman: Wow, thanks for all the notes. You’ve given me a fair pile of new things to implement, and that’s great. I also didn’t know about the --chunksize option to lvcreate, which means I’ve got a massive, dangerous assumption in my code I need to fix Right Freaking Now.

From: bill
2011-10-29 04:16

Isn’t it quite expensive to keep a snapshot around long on a running system? Still looks as if it could be useful in along with blocksync.py

From: Matt Palmer
2011-10-29 14:50

Hi Bill,

In terms of IOPS, having a snapshot round increases I/O load, but I wouldn’t call it “quite expensive” – if your storage system can’t handle the extra IOPS from a snapshot, it’s unlikely to handle reading an entire block device as fast as it can. At any rate, you shouldn’t have the snapshot for very long; just long enough to sync everything once, then run lvmsync to copy the changes, then you can delete it again.

From a brief read of the code, it looks like blocksync.py implements the other way I was looking at doing efficient block-level sync – compare all the blocks and only send the changed ones. As I mentioned, though, that only works well on slow links; when you can transfer data about as fast as you can read it, the time taken to scan a volume looking for changed blocks approaches the time it would take just to send the whole thing in the first place. If you need to sync block data over a constrained link (such as a typical Internet connection) then blocksync.py could work – as would lvmsync…

From: Joey Hess
2011-10-30 03:32

I think that bup has, or will soon have, a way to solve this same problem. If I were you I’d get in touch with Avery and see what he has to say.

From: Korkman
2011-10-30 22:06

Sorry I didn’t realize this earlier, but “tomato” from the archlinux forum has a very useful statistics gathering daemon in the works using blktrace. That would make any snapshot device mapper magic obsolete, I think. At least for any non-rootfs backups.

See current tail of https://bbs.archlinux.org/viewtopic.php?id=113529

From: Matt Palmer
2011-10-31 07:22

My main concern with something that’s entirely in userspace is the reliability of the metadata collector. Even for non-root-fs volumes, if the daemon (or whatever) that’s collecting the blktrace data crashes due to a bug (or whatever), you’ve got to rescan the entire volume looking for changes, a very costly operation, and one which basically destroys the whole point of the operation.

Really, all you’re doing with these sorts of solutions is building a poor-man’s DRBD, and for no good reason; you’re far, far better off using DRBD instead (if for no other reason than it already exists, and therefore can save you a pile of effort). lvmsync isn’t intended for use as a continuous replication mechanism, and I’d strongly advise against trying to turn it into one.

On the other hand, I really do like the idea of taking block I/O stats and shuffling LVM extents around to optimise performance; I’ll be keeping a close eye on that work.

From: Craig Sanders
2011-11-10 22:28

FYI, this sounds very similar to what ‘zfs send’ does (e.g. zfs send snapshot | ssh remotesystem zfs recv snapshot - it also relies on snapshots to know which blocks have changed and thus need to be sent.

you might pick up some useful ideas for your lvmsync tool from pages like:

http://www.markround.com/archives/38-ZFS-Replication.html

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Storing_ZFS_Snapshot_Streams_.28zfs_send.2Freceive.29

http://www.128bitstudios.com/2010/07/23/fun-with-zfs-send-and-receive/

From: Korkman
2011-11-12 02:57

The thing with DRBD is that I can’t use it for offsite backups. Even with the (proprietary) DRBD Proxy, updates get stuffed into the RAM and I suppose there’s no file shipping method available either (I really don’t like this commercial part of DRBD, sorry).

I’d like to dump changes from a snapshot into a file, transfer that, and apply changes offsite to ensure integrity, or possibly keep the files around and apply only oldest to have limited point-in-time recovery. Pretty much exacly what zfs-send does.

Of course, keeping snapshots around all the time would hurt performance, so lvmsync right now is only suitable for the initial problem: a one-time transfer from A to B.

From: Matt Palmer
2011-11-12 08:04

Korkman: sure, snapshots “hurt” (or, to put it more neutrally, “impact”) performance, but in my experience the impact isn’t that huge. Typically, the people saying “oh, you can’t use LVM snapshots they kill performance” it’s someone who has never actually used them, or is running their I/O system too close to capacity for comfort.

At any rate, it’s a tradeoff – do you want reliable capture of change data, or don’t you? Yes, you can write a dm backend to obtain that state, and that’d improve performance – and I’d encourage you to write such a feature into the kernel if that’s how you’d like to spend your time. I certainly wouldn’t want to rely on userspace for such a critical piece of my backup system.

It’d also be fairly trivial to modify lvmsync to dump it’s change data to stdout or a file, if you want to capture the changes for posterity. After that, it’s about a three-line cron job to make a new snapshot, dump out the changes recorded in the previous snapshot, and then remove that older one.

A “one-time transfer from A to B” is what dd is good for, not lvmsync. Yes, the documented use-case for lvmsync is a downtime-efficient migration of a block device from one system to another, because that’s what I wrote it for. How else you want to use it is up to your imagination and, if required, ability to modify Ruby code.

From: Marcus
2012-01-21 03:34

LVM snapshots in fact do extract a huge performance impact. In the past I’ve found it roughly 5x slower while the snapshot exists. Granted, benchmarks are ‘full blast’ applications, but a performance impact like that is bound to at least be noticeable even on moderately loaded machines.

Try a simple sequential write test on an LVM volume (dd if=/dev/zero of=whatever bs=1M could=4096 conv=fdatasync), then take a snapshot and do it again. I got 95.4MB/s without a snapshot, and 23.7MB/s with. Random read/write isn’t quite as bad, with random 64k iops dropping 45% and 4k iops dropping 39%.

I do get the point about the necessary evil, the tradeoff, however the reason I’m posting is because there is a light at the end of the tunnel. In kernel 3.2 there is <a href=http://kernelnewbies.org/Linux_3.2#head-782dd8358611718f0d8468ee7034c760ba5a20d3>device mapper support</a> not only for thin provisioning, but for snapshots that don’t extract much performance impact at all. In fact, you can supposedly do an arbitrary number of snapshots(of snapshots) without issue. I tested this <a href=http://learnitwithme.com/?p=334>last week</a> with promising results. Good news there.

lvmsync is an interesting tool, and I hope it stays up to date as LVM hopefully implements the 3.2 snapshot features. It could then be used as a daily differential backup. I found the site due to roughly the same concerns, although for me it’s the daily pain of reading the entire logical volume for backing up. KVM/Xen and the like already support live storage migration that allows me to move logical volumes between systems without downtime, so the one-time migration issue is solved for me… really it’s just backups now.

From: Durand
2012-08-08 22:16

lvmsync was working good with this version of lvm2 : LVM version: 2.02.87(2)-RHEL6 (2011-10-12) Library version: 1.02.66-RHEL6 (2011-10-12) Driver version: 4.22.6

But not with this one : LVM version: 2.02.95(2)-RHEL6 (2012-05-16) Library version: 1.02.66-RHEL6 (2012-05-16) Driver version: 4.22.6

I manage to modify lvmsync code to get the correct reading of dmsetup ls cause major minor value are provided with (major:minor) instead of (major, minor) to get rid of “Could not find dm device mysnapshot (name mangled to mysnapshot)’. Last but not least, the snapshot is not properly copied on the remote LV.

Sorry, I’m not that fluent in ruby…Perhaps mpalmer cauld do something about it.

Regards

From: Matt Palmer
2012-08-08 22:35

Hi Durand,

Try with the latest commits I just pushed to Github. Whilst I don’t have a machine with the latest version of LVM to test, I’ve made changes that should fix up the problem, based on your problem report.

It’s best to report problems by e-mailing theshed+lvmsync@hezmatt.org (as specified on the project page) since that automatically feeds into my todo list.

From: Leif Hetlesaether
2012-08-16 18:18

The latest update (https://raw.github.com/mpalmer/lvmsync/252b009a4bd279f26f470e2d98101432078b692f/lvmsync) fixed the mangled error. Synced 2 blockdevices without problems. Running lvm2-2.02.95-10.el6.x86_64

From: Eric Wheeler
2012-12-19 07:27

See a similar solution, blockfuse, which uses FUSE to represent block devices as files. Read-only support at the moment.

http://www.globallinuxsecurity.pro/blockfuse-to-the-rescue-rdiff-backup-of-lvm-snapshots-and-block-devices/

From: Matt Palmer
2012-12-22 13:43

Yeeeeeah… I’m not sure that blockfuse is anything like a “similar solution”. There are patches to rsync to let it read and duplicate the content of block devices, too – and they would, I expect, have many of the same problems as trying to do an rdiff-backup.

As a demonstration, try this: Create a 200GB LV. Take a backup of it. Change one byte of data in the LV. How long does it take to do an rdiff-backup of that LV using blockfuse? It won’t be less than the time it would take to read all the data off the device – because any program that works entirely off the content of the device has to read the entire device to find out what has changed.

That’s what is different about lvmsync. It uses metadata that DM keeps about a snapshot to only look at those parts of the block device that have changed, thus making the transfer much, much more efficient. When you’re trying to minimise downtime during a migration, that is important.

From: Eric Wheeler
2013-01-16 17:09

Great tool, fascinating use of the LV metadata In particular, I’ve never seen a block tool which hot-moves a volume and then (reliably) picks up the pieces that changed. Xen did something like this for vm-page migrations between physical servers.

I’m not familiar with the metadata format, I’m curious if you can fill in some edge-case questions about lvmsync:

This must require a snapshot of the basis volume at all times in order to work. Assuming this is true (correct me if its not), then: 1. If the snapshot volume runs out of its allocated space (origin changes fill the -cow PE’s), is there a graceful way to recover or are you back to a full-volume sync of some form?

Can I replicate the changed blocks without umounting or otherwise pausing IO on the basis volume, for example, by taking a second snapshot after the initial transfer? For example: lvcreate -s -n s1 /dev/vg/origin dd if=/dev/vg/origin of=>(somewhere) lvcreate -s -n s2 /dev/vg/origin # now send the the difference s1<=>s2 somewhere using a lvmsync/lvdiff between s1 and s2

This would be useful when you wish to get a near-complete copy of the server, but favor uptime over completeness.

Do you have plans to apply this concept to the dm-thinp implementation? It would be neat to issue something like lvdiff /dev/pool/origin /dev/pool/origin-snap and have it produce a list of differing blocks. There probably exist race conditions here, so you might need to take snapshots of each and diff the 2nd-order snapshots while the 1st-order snapshots are written to. Ultimately I think there is a way to replicate a consistent volume without freezing IO on the intended source volume.

-Eric

(Side note: well documented. Your prose is enjoyable to read).

From: Matt Palmer
2013-01-22 12:46

Hi Eric, let me answer your questions in order.

If a snapshot runs out of space, you’re boned. The change tracking metadata only has enough space to track the subset of changes that will fit in the snapshot, so you’ll lose track of changes and lvmsync won’t know about all the changes to transfer. So keep your snapshots healthy.
Yes, you can transfer just changed blocks without unmounting by taking a second snapshot, then using lvmsync to transfer the changes recorded by the first snapshot. I thought I had that documented in the README, but it looks like I didn’t. Patches welcomed.
I don’t have any specific plans to extend lvmsync into the realms of new DM metadata formats, but you never know what I might suddenly get an itch for doing one day (or what problem at work I suddenly decide I need to solve). The most likely “next step” at this stage would be to write a new dm backend that kept a record of “blocks changed”, and allowed you to efficiently get a list of the changes since point X. Bolt lvmsync onto that and you’d have a very efficient “periodic replication” mechanism.

Thanks for the feedback on my documentation. I never know if what I’m writing makes sense to anyone but me.

From: iamauser
2013-04-11 03:02

Hi Matt,

Your ‘lvmsync’ seems to be a nice tool. I am trying to use it now. I have a couple of problems with this.

-) I created the snapshot, copied the block device to a remote server. -) Boot up the backed up vm in a single user mode and everything works as expected. -) Then I did lvmsync on the snapshot while the source vm was running. -) lvmsync is able to sync the ‘diff’, but the backup vm no longer boots up.

Question is, does lvmsync only works if the source vm is paused or shut off ?

Another improvement I would suggest is to use a sudo option rather than root while doing ssh to remote server.

From: Matt Palmer
2013-04-11 13:32

Hi iamauser,

lvmsync works as advertised on volumes that are being concurrently written – but you won’t get a consistent view of the source volume because you’re writing to it while it’s being written.

If you feel that a sudo option would be of use, I’d encourage you to submit a pull request. I only implement features in lvmsync that I use myself.

From: DA
2014-01-05 20:12

QUESTIONS on –snapback.
Given: LV, LVSnap, LVSnapNew

(1) On the “Theory of Operation” section in the readme, could you say a few words about how snapback works? I gather LVSnap provides the delta between the present LV & the LV’s state when the snapshot was taken. Similarly for LVSnapNew. So, how do you go about getting the delta between LVSnap and LVSnapNew?

I am particularly confused about why you only need to point
lmsync --snapback at the older of the two snapshots, rather
than both of the previous snapshots.

(2) On each boot, I intend to take a snapshot of the root FS before it’s mounted, having saved one previous snapshot. I’ll use this
method: https://wiki.archlinux.org/index.php/Create_root_filesystem_snapshots_with_LVM

So at each boot I'll have the LV, + LVSnap from the previous boot.
Then I'll create LVSnapNew, save the snapback, and replace LVSnap
by LVSnapNew.  Do I understand correctly that your "fairly large
caveats" will not apply?

   "... the LV will still be collecting writes while you're
   transferring the snapshots, so you won't get a consistent
   snapshot (in the event you have to rollback, it's almost
   certain you'll neeed to fsck).  You'll almost certainly want to
   incorporate some sort of I/O freezing into the process, ..."

Regarding my previous post (suggestion on snapbacks), my personal used would be to save the LV as a compressed image and use –stdout to save compressed snapbacks. I would then apply snapbacks (and mount/explore the LV) without an intermediate decompression, using AVFS (avf.sourceforge.net). This would give a rather complete backup strategy.

Fantastic tool. Thanks for your work.

From: DA
2014-01-05 20:03

SUGGESTION. What would be really nice is if we could:

(a) Combine/consolidate snapbacks (eg to combine older daylies to weekleys, and/or simplify the application of –apply):

lvmsync --combine <Ordered_lits_of_snapbacks> FILENAME

(b) Apply a (or a combined) snapback, using the device mapper to generate a new device which is the state of the LV at a previous time:

    lvmsync --apply <SNAPBACK> <LV> DEVNAME

That is, produce a device, DEVNAME, which is the previous state of
the LV without formally patching the LV.  This new device could then
be mounted or otherwise used to explore the previous state of the
device.

I can see many applications of this, but mine will be mentioned in the next post (below).

This is a honeypot form. Do not use this form unless you want to get your IP address blacklisted. Use the second form below for comments.
Name:	(required)
E-mail:	(required, not published)
Website:	(optional)

Name:	(required)
E-mail:	(required, not published)
Website:	(optional)

Brane Dump

rsync for LVM-managed block devices

23 Comments

Keep it online

Make it work for fast networks

Get rid of lvm snapshots

Post a comment