Wednesday, May 21, 2014

Backup Strategies

I spent a number of years living with my documents, music, photos and various files scattered across various computers and different external backup drives. So when I shifted everything to a central NAS it was something of a revelation. It's pretty darned convenient having access to all the same files, no matter which computer or mobile device you use. I also found that consolidated stuff allowed me to save on hard drive space.
But there was a down side. Instead of having several copies of the same files (often, unintentionally), I now had a single copy. My haphazard approach to storing and managing things had the unintended consequence of working as a sort of backup strategy. And by consolidating everything, I no longer had this protection. What to do?

The 3-2-1 Rule

There is a commonly quoted "best practice" that says if your data is not stored in three different places, it isn't truly protected. In other words, you need more than just a backup... you need multiple backups. Some take this a step further -- not only do you need three copies of your data to be safe, but it should be stored on a least two different storage mediums. And at least one of the copies should be stored offsite. Hence the 3-2-1 rule.In a commercial environment, where data is critical to business operations, following the 3-2-1 rule may be difficult but almost certainly worthwhile. In a home environment, it becomes a bit more challenging. My NAS (as at the time of this post) has around 5-6TB of data. Regular backups of this volume of data, including offsite backup, is just not viable -- either economically, technologically or logistically. I'm running SnapRAID with q-parity, which provides something roughly equivalent of RAID6. So that is some measure of protection, but it's not a backup.

Why RAID is not a backup

Spend any time discussing RAID (and/or backup strategies) on an internet forum, and someone is bound to pipe up with the reminder that RAID is not a backup!!!. With the exception of RAID0 (which some geeks don't consider to be real RAID), the various RAID levels are designed to provide some degree of data redundancy. The term redundancy tends to imply that a RAID array creates multiple copies of the data. In the case of RAID1, where the data is mirrored across drives, this is certainly true. In the case of RAID5, the data itself isn't mirrored across drives, but parity information is. And that parity information is sufficient to rebuild your array in the event that one of the drives fails. So in both instances, RAID1 and RAID5, it certainly sounds a lot like a backup.So what is the difference?The purpose of a backup is to help you restore your data in the event of a catastrophic failure. RAID5, for example, allows you to restore your data in the event of a drive failure. But if two drives fail at the same time, then your data is lost. If you don't mind forking over the cash, you can expand your array to RAID6, which provides parity protection like RAID5, but with double the parity. With RAID6, you can survive two drive failures. But what if you have a third fail at the same time?Does that seem statistically unlikely? Maybe so. But drive "failure" can encompass a range of things. Ever had a power surge that has damaged electrical equipment? Flooding? Fire? Theft? Any of these events could see all the drives in your array essential "fail". And in these instances, no level of RAID is going to protect you. Hence the mantra that RAID is not a backup.

Developing a Backup Strategy

So here's the basic approach I took. I roughly classified my content into three broad categories;

  1. High priority -- really important stuff that I really, really don't want to lose. Mostly this includes personal and business documents, family photos, and a few other miscellaneous things (such as online purchases that I can't download again). If these get lost or destroyed, then they are essentially gone forever. Therefore, they should be afforded the strongest level of protection.
  2. Medium priority -- this is stuff that I don't want to lose, but it wouldn't be that devastating if they were destroyed. This is difficult to categorise. I have hundreds of movies and TV shows that have been ripped from DVD or Blu Ray. If they were accidentally lost, it would be truly painful to go through that process again. But the reality is that it could be done. Plus, I could go out and re-buy them if I really needed to (say, in the event of a fire... and assuming they were adequately covered by insurance). This is the sort of stuff that I make my best effort to backup.
  3. Low priority -- stuff where it really doesn't matter if it is lost. Mostly this is temporary files, downloads (ie, the latest copy of Open Office, or Gimp), cloud purchases (which can be re-downloaded), etc. No backup plan for this stuff.
Now that everything has been classified, what next?

Implementing the Strategy

Personally, I've set up my own VPS. The main benefit to using a VPS for backups, is that I can manage the backup process myself. For example, most backup services, like Skydrive, Dropbox, etc, require you to use a proprietary client. If you are lucky, the service also has an API, which means there may be third-party client tools available. But unless you are a developer, you are still going to be limited in the tools that are available to you. Whereas my VPS is just a server, running Linux so it is pretty generic. Why is this important?My favoured approach is using rsync from my NAS to my VPS. But I could transfer files over FTP, NFS, or whatever. I could install owncloud if I wanted, or other similar services. I also like the fact that I can transfer from one VPS to another, pretty easily. In fact, I can set this up to run in the background, offsite location to offsite location, without impacting on my own internet connection. Whereas moving from Dropbox to Skydrive means re-uploading all your data to your new service provider. Given my mediocre upload speeds, this is not something I'd relish.So running my own VPS has it's advantaged. But if setting up a VPS seems like overkill, then pick your backup service provider and use whatever tools they have available. Consider price, speed, reliability, security... whatever is important to you. All the computers in my house connect directly to the NAS, and draw documents from the NAS as required. So as long as I backup from the NAS on a regular basis, everything should be fine. There are a couple of exceptions, though. For example, I keep my music collection on my laptop, so it's available to me whenever I'm away from home. So I periodically sync my laptop with the NAS, usually using rysnc. My laptop is a Macbook Pro, so rsync is available out-of-the-box. If you are using windows.... erm... well, I'm sure there is something similar :)Ok. So, assuming the NAS is more or less the definitive source of all my data, and assuming I've classified all my data according to priority, the following then applies;

  1. High priority stuff gets backed up every night to the VPS. I've written some really simple shell scripts to handle this (which I'll cover in another post), and scheduled them to run at 1am every day. Periodically, high priority stuff is also backed up to external hard drive/s connected to my NAS. This more or less makes me compliant with the 3-2-1 rule -- I have three copies of my data (NAS, cloud and backup drive) and at least one of those is offsite (cloud). Technically, it's not on two different mediums, unless you consider cloud storage to be a different medium to a hard drive. But it's pretty close. Maybe I should consider getting a blu-ray burner?
  2. Medium priority stuff is backed up periodically to external hard drive's, but not stored on the cloud. Since most of this material is ripped DVDs and Blu Rays, you could argue that technically I have three copies of the data -- NAS, backup drive and the original discs. And I comply with the requirement for two different mediums (optical disc and hard drive), but I don't comply with the offsite requirement. If I had the means, I suppose I could box up the original discs and put them in storage. And in the event that something really serious happened, I could always re-buy all the discs. So again, that's pretty close to the 3-2-1 rule.
  3. Low priority stuff doesn't get backup at all. But as with all the data, it is protected by SnapRAID's parity protection. On a few occasions I've accidentally deleted files, and been able to restore them using SnapRAIDs functionality. So that's pretty handy, and about as much as I can expect for temporary files.


So, do I comply with best practices? Not exactly. But for a home user, looking to manage a large volume of data in a cost-effective manner, I think I come pretty close. At the end of the day, it's all about tradeoffs. How important is your data? How much are you willing to spend? How much effort are you prepared to go to? At the very least, these are questions you should stop and think about. There will be a time when you lose your data, either through a careless accident, bad luck or malfeasant technology. So be prepared.