As more and more of our life becomes digital, we've come to rely on digital files to store and represent all that we've come to love: music (either digitized old CDs or bought oniine from iTunes), video (ripped from legally bought DVDs, or bought online on iTunes), pictures (taken with your great DSLR or mobile phone), and now soon (e-)books (for the Kindle, or iPad), etc.
Haven't you wondered whether the book you just bought will be available to read by your great-grand children (I have books from the mid-19th century!); that the picture you just took will be digitally framed on a wall by your great-grand children (I have pix from the end of the 19th century in my living room) ? (pix here is my great-great-grandparents crossing to Chile on one of their trips, over 100 years ago).
I've been bothered by the problem of data durability for quite some time (and if you browse this blog back a few years, it's a persistent theme). It's actually two separate problems :
- data integrity : this is what we usually call data backup. How do you maintain integrity of your files on your current system ?
- data persistence : but it's more than that. What happens when you change computers (or devices)? when technology changes ? when generations change ? See, I started collecting my data in the early 80s on 5"1/4 diskettes (remember those ?), then I moved them to 3,5" diskettes (I still have a drive to read them. Do you ?). Then I burned them to CDs (that have a limited life span !). I then moved the data to DVDs ! Currently I moved all my data to external hard disks. What's next ? The only question here is that I need to keep moving my data regularly onto new technologies before the previous one gets obsolete. I get worried though that I'll have a. the data, b. the software c. the devices to keep accessing all my data. Some steps I'm taking here is that I'll start printing books with my pictures.
- data repository unicity: the two points above assumed you keep all your data in the same place but that is not so. You produce very interesting and/or important data elsewhere as well on social sites and networks such as your blog, twitter, facebook, gmail, etc. I might come back to this point in another post, but online solutions such as backupify are spot-on and very useful.
So back to data integrity. The key criteria for a backup strategy should be :
- reliability vs. availability : do I want to have my data available (ie. accessible) ALL the time (such as my posts on my Facebook page - and I don't even know about that), or my data SAFE (a reliable hence usable copy)?
The best solution for 100% (or close) up-time is mirroring 2 or more disks (RAID 1), so that you get 1 more exact copy on the-fly. But it has drawbacks: a. it's expensive when you have large volumes, and might take a performance toll on your system. b. if you're saving a corrupted file (happens all the time), then you're saving several copies of a corrupted file. Not what you want.
Availability says you are guaranteed to have a recent enough copy of your data, and even several copies, scattered over time, so that you can roll back to an uncorrupted version of your data. Restoring an old copy might take time (such as getting an archive tape from somewhere, rebuilding a directory from incremental backups.
=> in my case, for my personal data (videos, pictures, etc.), my data doesn't change much once I store it: I usually only add new data, and do not work much with old data (modifying old pix for example). Hence I don't really need an incremental backup for my data (only for my system, and some office documents).- usability: or should I say, simplicity. Most backup strategies fail because they become too complicated for a home use. You can impose procedures in corporations, but they are hard to follow otherwise. Therefore, the most automated system with automatic alarms is probably the best way to go.
I also don't like software that produce backups that are not immediately readable with the operating system (ie. restore a proprietary archive). Hence copies of my files that I can read just by plugging in an external drive elsewhere are just great. This should be just the same as inserting a data DVD. This last point becomes crucial when you are dealing with large data volumes; I'm currently working with more than 2TB which is the largest disk size on the market. TimeMachine will only allow for "backup" to ONE disk, and start deleting old copies. I don't want that. I could create a larger disk by combining several external disks (JBOD-style concatenation, or RAID0-stripping). But that is a problem in itself : if one disk fails in the target disk pool, all the data fails. Or if I want to re-read all the disks elsewhere, I need to setup all the disks just to read one file. Hence option out.
=> therefore I need a solution that allows me to backup (sync ?) specific directories to specific disks. Ideally though, a program would just say work like this : here are all the files I want to backup (from different disks, partitions, RAID arrays, directories, etc.) to this pool of disks. And the program would slice it down automatically without asking for more. The problem would however be that I would need to maintain a "master" disk with the logs and the index of all files: therefore restoring a file (smaller than 2TB of course ! which might be a problem, since my picture library for just last year is already almost 0,5 TB) would only require in worst case 2 disks: the index, and the data disk.
All backup strategies should include a combination of the following :
- on-site and off-site backups separated by a reasonable distance (think natural disaster like the earthquake in Chile, fire, theft, etc.) : we NEVER do this, but we should as our data is so much more important than the computer value these days. Backing up ONLINE is an alternative for a remote off-site, but it requires probably a very large disk for me (in the TBs, not GBs), and a very very fast line (fiber) unless I want a year-long backup. I will have to put my most precious data (pix, family videos, document archives from my work) on an external disk and put them elsewhere, just in case. I'd do this on a monthly basis at best. This is called a time-stamped archive. There's no point archiving data you can get back in some way or another.
=> off-site archive is key
- father/son/grandfather strategy, more usually known as day / week / monthly backup, or hour / day / week, etc. The idea here is to rotate between different copies of a disk, because sometimes your data gets corrupted and you need to go back in time. I would keep one full backup on-site next to my computer. However, when you do a new backup, you're scratching the current backup, so for the period you're rebuilding the backup, there's no proper backup...
=> hence you need at least 2 rotating backup sets.
There is an incredible number of backup software out there for the
Macintosh. Hence, probably the only way to choose the right one, is to
FIRST decide on selection criteria, then narrow down the list by
- supported features
- usability
- price
I'll test out a few programs in a second post, then in a final post I'll describe the whole final setup and proper procedures.