Wombat

Wombat is a rsync backup script that I’ve been working on for awhile. What sets wombat apart from other rsync backup tools is the following:

Threaded: Several images can be created simultaneously to maximize network thruput.
Multiple Writeto: Images can be written to multiple destinations for redundancy.
Config File: All features of wombat can be controlled thru a config file.
Target Locking: Wombat is designed to be run hourly, but if an image requires longer than an hour to complete, it will not “stack up”.
Pruning: Old images are pruned using a simple “English” command syntax that defines which to keep.

What is Rsync Backup?

“Rsync Backup” is a general term that I am using for a very specific type of backup. All “rsync backup” really means is that you would use rsync to make a backup. When I say “rsync backup” what I really mean is “backup images written to disk created as snapshots using rsync and hard-links” which you can see would be much harder to say repeatedly. In any event, there are two tools that are important here: rsync and hard-links.

Rsync is a very mature tool whose primary purpose is to efficiently synchronize files between multiple storage devices, especially over long distance. The basic idea is that if there is only a 1% change in a large repository, then you really only need to copy that tiny 1% of the data and “patch” the files in the target destination. While having redundant images is helpful in the event of data-loss, a good backup system usually desires to maintain several “images” representing a data-source over time. If the backup system attempted to maintain several “full” images, this would require a tremendous amount of disk space and places an undue burden on the network during the creation of each full image. As such rsync is only one part of the equation required for making a “rsync backup”. The other piece is hard-links.

The vast majority of Unix filesystem support a type of file called a “hard-link”. The idea is that you can create a copy of a file without copying it by simply creating an additional pointer, or directory entry, to the file contents. The advantage is that the directory entry, or “copy” of the file, requires essentially no additional disk space. Also, when deleting directory entries an individual files contents are preserved until there are no pointers left to the data.

One important thing to keep in mind is that hard-links can not cross file-system boundaries; there is no trivial way to create a pointer to the file contents in another filesystem since the pointer must point to filesystem level data (you *can* create a soft-link which is essentially a pointer to another pointer, but this does not have the same data
preservation as a hard-link).

The way these tools are combined is to use rsync to copy over only the changes and hard-links to make a nearly disk-free duplicate of files that have not changed. In this manner, each backup image is a “full” image, yet consumes only the additional space of the changes made. This has two very important advantages over more traditional dump levels:

Each image is a full image: This is tremendously useful. If you need to recover an entire filesystem state you only need to restore from one image. In a more traditional dump-level setup, you would need to restore from the most recent baseline and then restore from each of the incrementals, in order, which can be very time consuming depending on how the dump-levels are configured.
Each image is an “incremental”: The idea of doing a “baseline” is gone. Every image is a full image and only the differences need to be copied each time the backup is executed. The only full backup that occurs is the very first rsync. This makes backing up large repositories on a frequent basis possible, even if the network is slow.

If you would like to read more about the exact mechanics of how to manually perform “rsync backups”, or peruse lists of other available tools, I recommend Mike Rubel’s rsync backup page.

Backups and Marsupials?

You might be asking yourself why this backup software is called “wombat”. Well, I would love to tell you that it is some super nifty self-referencing acronym like “Wombat Only Makes Backups All Tasty” or “Whenceforth Old Methods of Backup Are Tedious”. Or perhaps I’m secretly a closet marsupial scientist angling for better coverage of the lesser pouched critters. But in truth, my life was once saved by a herd of wombats…

Well, not really. In actual truth, I was tired of calling it “that rsync backup tool” and I just happened to be working on it while my daughter was watching a Baby Einstein video. And the subject matter at hand just happened to be wombats. What a neat little animal, I remember thinking. And that’s pretty much that.

Why use wombat?

If you’ve got a pretty good understanding of how rsync backups work, you’re probably thinking this should be something like a 5 line script. And this would inevitably lead you to be wary of wombat which is something like a 1,000 lines of perl to control what seems to be a single rsync call. And yes, if you are backing up one machine then wombat is probably over-kill.

I wrote wombat because I didn’t want to have to maintain a monolithic script that managed backup images. I wanted something I could configure easily, and something that could clean up old images with great flexibility. And I wanted something that could backup a lot of hosts and filesystems quickly and efficiently. And I wanted all this to be controlled with a config file, not redundant cron entries. When I made a list of all the features I wanted, there wasn’t anything out there that did it all; now there is.

What’s Missing?

The 1.0 version of wombat is pretty much the first version to support the essential features that actually make it useful. I’ve recently made a 2.0 “beta” available for download – the strikeouts below represent things that are fixed in the latest version. There is a pretty good list of things I want to finish in the TODO file, and the ChangeLog has an excellent record of what is done…

linkto: Since images can be written to multiple destinations, it would be great to automatically create a “soft-link” tree so all of the images available can be viewed from one directory. Another idea along this line would be to create a link tree that replaces individual files with a folder whose contents are links to each of the versions of that file that are available in the backup images.
Error Handling: At present, wombat keeps a fairly detailed logfile, but it makes no distinction between regular output and errors. There needs to be a good way of culling errors from the log data and reporting them, perhaps thru email.
Reports: There needs to be a reporting function that can show what images exist for a given host:/target, when they were created, the size, what images failed, etc. Of course one could simply peruse the directory full of backup images, but it would be much nicer and quicker to just run something like “wombat.pl –report tank:/home” and see good information.
Hang Detection: If an rsync hangs, it will keep the target locked for ever. Combine this with the lack of error handling, and you have a dangerous situation. There needs to be a user specifiable parameter that controls when a rsync image will be forcibly timed out.
Code Cleanup: I pretty much bulldozed the current code. I’ve got to go thru and remove some cruft, clarify variable naming/scoping, etc.
Rsync Options: I would like to provide a mechanism for passing thru arbitrary rsync options from the wombat.conf file.
Archiving/Remote: Create a subsystem for syncing newly created images to a second image repository, probably offsite, for further redundancy.
Passive Clients: The idea is to create a space that a client could write it’s own images into, rather than having the server pull them, but then still have the server handle the pruning of old images. This could be used to backup machines that are not always connected, like laptops.
ACLs: I think it is possible to preserve ACLs when backing up windows machines, but it will be a bit hacky. This could be done rather universally, that is for linux and windows, using something simple like getfacl/setfacl and a linux filesystem on the backup host that understands ACLs. However, this isn’t perfect since some of the Windows permissions still do not map cleanly to linux ACLs. Still, it would be nice to have it at least preserve the important ACLs.
Docs: I’ll be cranking away at this probably before working on any additional features.

Download

Wombat 2.0

This is a major update, with lots of features and changes. I’ve been running the new version to backup production systems for over a year, but have just been too busy to get all the changes done that I want for what I might call a “2.0” release.

There are a lot of changes – be sure to read the README and ChangeLog since I have not been updating the online documentation with the new features.

Wombat 1.0

This is the first publicly available version.