rsync for LVM-managed block devices
Posted: Fri, 28 October 2011 | permalink | 23 Comments
If you’ve ever had to migrate a service to a new machine, you’ve probably
rsync to be a godsend. It’s ability to pre-sync most data while the
service is still running, then perform the much quicker “sync the new
changes” action after the service has been taken down is fantastic.
For a long time, I’ve wanted a similar tool for block devices. I’ve managed
ridiculous numbers of VMs in my time, almost all stored in LVM logical
volumes, and migrating them between machines is a downtime hassle. You need
to shutdown the VM, do a massive
dd | netcat, and then bring the machine
back up. For a large disk, even over a fast local network, this can be
quite an extended period of downtime.
The naive implementation of a tool that was capable of doing a block-device
rsync would be to checksum the contents of the device, possibly in blocks,
and transfer only the blocks that have changed. Unfortunately, as network
speeds approach disk I/O speeds, this becomes a pointless operation.
Scanning 200GB of data and checksumming it still takes a fair amount of time
– in fact, it’s often nearly as quick to just send all the data as it is to
checksum it and then send the differences.1
No, a different approach is needed for block devices. We need something that keeps track of the blocks on disk that have changed since our initial sync, so that we can just transfer those changed blocks.
As it turns out, keeping track of changed blocks is exactly what LVM snapshots do. They actually keep a copy of what was in the blocks before it changed, but we’re not interested in that so much. No, what we want is the list of changed blocks, which is stored in a hash table on disk.
All that was missing was a tool that read this hash table to get the list of blocks that had changed, then sent them over a network to another program that was listening for the changes and could write them into the right places on the destination.
That tool now exists, and is called
lvmsync. It is a slightly crufty
chunk of ruby that, when given a local LV and a remote machine and block
device, reads the snapshot metadata and transfers the changed blocks over an
SSH connection it sets up.
Be warned: at present, it’s a pretty raw piece of code. It does nothing but the “send updated blocks over the network”, so you have to deal with the snapshot creation, initial sync, and so on. As time goes on, I’m hoping to polish it and turn it into something Very Awesome. “Patches Accepted”, as the saying goes.
rsyncavoids a full-disk checksum because it cheats and uses file metadata (the last-modified time, or
mtimeof a file) to choose which files can be ignored. No such metadata is available for block devices (in the general case). ↩
Thank you so very much! This is something I’ve been sorely missing from my data-wrangling toolbox. It always baffled me why a similar “send snapshot to remote machine” utility hasn’t been already part of lvm2.
(commenting on the readme at github)
Keep it online
Add a second snapshot to your concept and you can transfer changed blocks online, with no downtime.
Keeping a snapshot between backups is, of course, quite a performance hit so lvcreate chunksize should be rather large to minimize long-term impact.
Make it work for fast networks
If you didn’t do so already, let ssh use cipher arcfour or blowfish (in that order or preference) and make compression optional.
Or let the user define the pipe.
I used to use qpress compression and mbuffer over gigabit links, so that would transform to something like:
Get rid of lvm snapshots
When a snapshot runs out of space, does it stop tracking changed blocks? If so, spinoff project could be a replacement for snapshots that doesn’t actually store anything (copy & paste from snapshots implementation, minus storage).
From: Matt Palmer
rpetre: I’m not sure why it didn’t exist, either, given that it was surprisingly simple to implement. I guess it’s one of those “don’t realise you needed it until you have it” deals.
Korkman: Wow, thanks for all the notes. You’ve given me a fair pile of new things to implement, and that’s great. I also didn’t know about the
--chunksizeoption to lvcreate, which means I’ve got a massive, dangerous assumption in my code I need to fix Right Freaking Now.
Isn’t it quite expensive to keep a snapshot around long on a running system? Still looks as if it could be useful in along with blocksync.py
From: Matt Palmer
In terms of IOPS, having a snapshot round increases I/O load, but I wouldn’t call it “quite expensive” – if your storage system can’t handle the extra IOPS from a snapshot, it’s unlikely to handle reading an entire block device as fast as it can. At any rate, you shouldn’t have the snapshot for very long; just long enough to sync everything once, then run lvmsync to copy the changes, then you can delete it again.
From a brief read of the code, it looks like blocksync.py implements the other way I was looking at doing efficient block-level sync – compare all the blocks and only send the changed ones. As I mentioned, though, that only works well on slow links; when you can transfer data about as fast as you can read it, the time taken to scan a volume looking for changed blocks approaches the time it would take just to send the whole thing in the first place. If you need to sync block data over a constrained link (such as a typical Internet connection) then blocksync.py could work – as would lvmsync…
From: Joey Hess
I think that bup has, or will soon have, a way to solve this same problem. If I were you I’d get in touch with Avery and see what he has to say.
Sorry I didn’t realize this earlier, but “tomato” from the archlinux forum has a very useful statistics gathering daemon in the works using blktrace. That would make any snapshot device mapper magic obsolete, I think. At least for any non-rootfs backups.
See current tail of https://bbs.archlinux.org/viewtopic.php?id=113529
From: Matt Palmer
My main concern with something that’s entirely in userspace is the reliability of the metadata collector. Even for non-root-fs volumes, if the daemon (or whatever) that’s collecting the blktrace data crashes due to a bug (or whatever), you’ve got to rescan the entire volume looking for changes, a very costly operation, and one which basically destroys the whole point of the operation.
Really, all you’re doing with these sorts of solutions is building a poor-man’s DRBD, and for no good reason; you’re far, far better off using DRBD instead (if for no other reason than it already exists, and therefore can save you a pile of effort).
lvmsyncisn’t intended for use as a continuous replication mechanism, and I’d strongly advise against trying to turn it into one.
On the other hand, I really do like the idea of taking block I/O stats and shuffling LVM extents around to optimise performance; I’ll be keeping a close eye on that work.
From: Craig Sanders
FYI, this sounds very similar to what ‘zfs send’ does (e.g.
zfs send snapshot | ssh remotesystem zfs recv snapshot- it also relies on snapshots to know which blocks have changed and thus need to be sent.
you might pick up some useful ideas for your lvmsync tool from pages like:
The thing with DRBD is that I can’t use it for offsite backups. Even with the (proprietary) DRBD Proxy, updates get stuffed into the RAM and I suppose there’s no file shipping method available either (I really don’t like this commercial part of DRBD, sorry).
I’d like to dump changes from a snapshot into a file, transfer that, and apply changes offsite to ensure integrity, or possibly keep the files around and apply only oldest to have limited point-in-time recovery. Pretty much exacly what zfs-send does.
Of course, keeping snapshots around all the time would hurt performance, so lvmsync right now is only suitable for the initial problem: a one-time transfer from A to B.
From: Matt Palmer
Korkman: sure, snapshots “hurt” (or, to put it more neutrally, “impact”) performance, but in my experience the impact isn’t that huge. Typically, the people saying “oh, you can’t use LVM snapshots they kill performance” it’s someone who has never actually used them, or is running their I/O system too close to capacity for comfort.
At any rate, it’s a tradeoff – do you want reliable capture of change data, or don’t you? Yes, you can write a dm backend to obtain that state, and that’d improve performance – and I’d encourage you to write such a feature into the kernel if that’s how you’d like to spend your time. I certainly wouldn’t want to rely on userspace for such a critical piece of my backup system.
It’d also be fairly trivial to modify
lvmsyncto dump it’s change data to
stdoutor a file, if you want to capture the changes for posterity. After that, it’s about a three-line cron job to make a new snapshot, dump out the changes recorded in the previous snapshot, and then remove that older one.
A “one-time transfer from A to B” is what
ddis good for, not
lvmsync. Yes, the documented use-case for
lvmsyncis a downtime-efficient migration of a block device from one system to another, because that’s what I wrote it for. How else you want to use it is up to your imagination and, if required, ability to modify Ruby code.
LVM snapshots in fact do extract a huge performance impact. In the past I’ve found it roughly 5x slower while the snapshot exists. Granted, benchmarks are ‘full blast’ applications, but a performance impact like that is bound to at least be noticeable even on moderately loaded machines.
Try a simple sequential write test on an LVM volume (dd if=/dev/zero of=whatever bs=1M could=4096 conv=fdatasync), then take a snapshot and do it again. I got 95.4MB/s without a snapshot, and 23.7MB/s with. Random read/write isn’t quite as bad, with random 64k iops dropping 45% and 4k iops dropping 39%.
I do get the point about the necessary evil, the tradeoff, however the reason I’m posting is because there is a light at the end of the tunnel. In kernel 3.2 there is <a href=http://kernelnewbies.org/Linux_3.2#head-782dd8358611718f0d8468ee7034c760ba5a20d3>device mapper support</a> not only for thin provisioning, but for snapshots that don’t extract much performance impact at all. In fact, you can supposedly do an arbitrary number of snapshots(of snapshots) without issue. I tested this <a href=http://learnitwithme.com/?p=334>last week</a> with promising results. Good news there.
lvmsync is an interesting tool, and I hope it stays up to date as LVM hopefully implements the 3.2 snapshot features. It could then be used as a daily differential backup. I found the site due to roughly the same concerns, although for me it’s the daily pain of reading the entire logical volume for backing up. KVM/Xen and the like already support live storage migration that allows me to move logical volumes between systems without downtime, so the one-time migration issue is solved for me… really it’s just backups now.
lvmsync was working good with this version of lvm2 : LVM version: 2.02.87(2)-RHEL6 (2011-10-12) Library version: 1.02.66-RHEL6 (2011-10-12) Driver version: 4.22.6
But not with this one : LVM version: 2.02.95(2)-RHEL6 (2012-05-16) Library version: 1.02.66-RHEL6 (2012-05-16) Driver version: 4.22.6
I manage to modify lvmsync code to get the correct reading of
dmsetup lscause major minor value are provided with (major:minor) instead of (major, minor) to get rid of “Could not find dm device mysnapshot (name mangled to mysnapshot)’. Last but not least, the snapshot is not properly copied on the remote LV.
Sorry, I’m not that fluent in ruby…Perhaps mpalmer cauld do something about it.
From: Matt Palmer
Try with the latest commits I just pushed to Github. Whilst I don’t have a machine with the latest version of LVM to test, I’ve made changes that should fix up the problem, based on your problem report.
It’s best to report problems by e-mailing
email@example.com(as specified on the project page) since that automatically feeds into my todo list.
From: Leif Hetlesaether
The latest update (https://raw.github.com/mpalmer/lvmsync/252b009a4bd279f26f470e2d98101432078b692f/lvmsync) fixed the mangled error. Synced 2 blockdevices without problems. Running lvm2-2.02.95-10.el6.x86_64
From: Eric Wheeler
See a similar solution, blockfuse, which uses FUSE to represent block devices as files. Read-only support at the moment.
From: Matt Palmer
Yeeeeeah… I’m not sure that blockfuse is anything like a “similar solution”. There are patches to rsync to let it read and duplicate the content of block devices, too – and they would, I expect, have many of the same problems as trying to do an rdiff-backup.
As a demonstration, try this: Create a 200GB LV. Take a backup of it. Change one byte of data in the LV. How long does it take to do an rdiff-backup of that LV using blockfuse? It won’t be less than the time it would take to read all the data off the device – because any program that works entirely off the content of the device has to read the entire device to find out what has changed.
That’s what is different about lvmsync. It uses metadata that DM keeps about a snapshot to only look at those parts of the block device that have changed, thus making the transfer much, much more efficient. When you’re trying to minimise downtime during a migration, that is important.
From: Eric Wheeler
Great tool, fascinating use of the LV metadata In particular, I’ve never seen a block tool which hot-moves a volume and then (reliably) picks up the pieces that changed. Xen did something like this for vm-page migrations between physical servers.
I’m not familiar with the metadata format, I’m curious if you can fill in some edge-case questions about lvmsync:
This must require a snapshot of the basis volume at all times in order to work. Assuming this is true (correct me if its not), then: 1. If the snapshot volume runs out of its allocated space (origin changes fill the -cow PE’s), is there a graceful way to recover or are you back to a full-volume sync of some form?
This would be useful when you wish to get a near-complete copy of the server, but favor uptime over completeness.
(Side note: well documented. Your prose is enjoyable to read).
From: Matt Palmer
Hi Eric, let me answer your questions in order.
If a snapshot runs out of space, you’re boned. The change tracking metadata only has enough space to track the subset of changes that will fit in the snapshot, so you’ll lose track of changes and lvmsync won’t know about all the changes to transfer. So keep your snapshots healthy.
Yes, you can transfer just changed blocks without unmounting by taking a second snapshot, then using lvmsync to transfer the changes recorded by the first snapshot. I thought I had that documented in the README, but it looks like I didn’t. Patches welcomed.
I don’t have any specific plans to extend lvmsync into the realms of new DM metadata formats, but you never know what I might suddenly get an itch for doing one day (or what problem at work I suddenly decide I need to solve). The most likely “next step” at this stage would be to write a new dm backend that kept a record of “blocks changed”, and allowed you to efficiently get a list of the changes since point X. Bolt lvmsync onto that and you’d have a very efficient “periodic replication” mechanism.
Thanks for the feedback on my documentation. I never know if what I’m writing makes sense to anyone but me.
Your ‘lvmsync’ seems to be a nice tool. I am trying to use it now. I have a couple of problems with this.
-) I created the snapshot, copied the block device to a remote server. -) Boot up the backed up vm in a single user mode and everything works as expected. -) Then I did lvmsync on the snapshot while the source vm was running. -) lvmsync is able to sync the ‘diff’, but the backup vm no longer boots up.
Question is, does lvmsync only works if the source vm is paused or shut off ?
Another improvement I would suggest is to use a sudo option rather than root while doing ssh to remote server.
From: Matt Palmer
lvmsync works as advertised on volumes that are being concurrently written – but you won’t get a consistent view of the source volume because you’re writing to it while it’s being written.
If you feel that a sudo option would be of use, I’d encourage you to submit a pull request. I only implement features in lvmsync that I use myself.
QUESTIONS on –snapback.
Given: LV, LVSnap, LVSnapNew
(1) On the “Theory of Operation” section in the readme, could you say a few words about how snapback works? I gather LVSnap provides the delta between the present LV & the LV’s state when the snapshot was taken. Similarly for LVSnapNew. So, how do you go about getting the delta between LVSnap and LVSnapNew?
(2) On each boot, I intend to take a snapshot of the root FS before it’s mounted, having saved one previous snapshot. I’ll use this
Regarding my previous post (suggestion on snapbacks), my personal used would be to save the LV as a compressed image and use –stdout to save compressed snapbacks. I would then apply snapbacks (and mount/explore the LV) without an intermediate decompression, using AVFS (avf.sourceforge.net). This would give a rather complete backup strategy.
Fantastic tool. Thanks for your work.
SUGGESTION. What would be really nice is if we could:
(a) Combine/consolidate snapbacks (eg to combine older daylies to weekleys, and/or simplify the application of –apply):
(b) Apply a (or a combined) snapback, using the device mapper to generate a new device which is the state of the LV at a previous time:
I can see many applications of this, but mine will be mentioned in the next post (below).
Post a comment
All comments are held for moderation; markdown formatting accepted.