Monday, November 11, 2013

Backups to Amazon S3 -- simple and efficient

We use Amazon S3 as a part of our backup strategy -- all of our backup servers in datacenter replicate local backup images to S3 daily. While we have at least six physical copies of each backup (3 copies, each on different machine, all backup disks are in RAID1), having an offsite backup for DR even if we'll go multi-datacenter and will replicate data in semi-realtime is very important.
During the lifetime of this installation we used different approaches. The first one was using s3cmd -- while very functional and reliable it was very slow, because there was no real way to determine what changed between local and remote "filesystem" and just copying 100s of gigabytes per host per day was very slow. We thought that something like rsync would be much better, so we moved to s3fs+rsync. Unfortunately, it was very unstable and either required a second copy of files (to cache remote attributes) or was prone to downloading parts of files to compare them with originals, to determine if the file itself should be copied. We also evaluated duplicity (it consumed large amounts of temporary space) and a couple of commercial solutions, but none of them were good enough for us.
So, I decided to write a simple utility that would do this kind of a sync -- s3backup.
Features:
  • easy to use -- configure AWS credentials, specify source and destination, put it into your crontab and you're set
  • resource-efficient -- never downloads remote files, compares file sizes (good if your backups are named differently for each day) and/or MD5 checksums (good for other cases, consumes a bit more CPU), does not attempt to read entire file in RAM, etc.
  • works with large files -- currently, the upper limit is about 500GB (10,000 chunks, 50MB each), this can easily be increased up to 5TB if you don't need MD5 checksum comparison
  • supports recycling -- locally-removed files are removed from S3 only after they reach a specified age
You're welcome to try it out and use it. Feedback (especially about file sizes you manage and RAM constraints you have on backup machines) is very welcome also.

No comments:

Post a Comment