I’ve been tasked with revamping the backup strategy at my new job, and I think I am finally happy with what I’ve come up with. As usual, I’m wondering what others have done when they were tackling such a task, and what you, dear reader, think about my setup. Please leave me a comment if you want to share your thoughts, or links to articles you’ve read that relate to anything discussed here (along with a short description of how it relates).
With websites, it appears there are three distinct data components that require backups: Code, content files, and the database. If you implement a well-thought-out version control setup, you can keep only one backup of the current repository, since it in itself contains all the revisions.
- Code only ever changes when the developer makes an update, and always changes on the development server (I swear!). The developer commits his changes to version control manually, and then uses a deployment automation tool to publish the changes to the production server.
- With content files and the database, the changes are mainly done on the production server, and since we are not going to ask the user to commit to our version control system, this should be automated, either at intervals, like daily – or if it is plausible – just-in-time, directly triggered through the CMS when content is added.)
With incremental backups to Amazon S3, we used to pay about $1/month for this service at the medium-size call centre I used to work at, so that part should be fairly inexpensive, but the overall cost of this setup is significantly higher if you are paying for a hosted VCS.
Again, I would really appreciate your thoughts on this, and links to articles written based on what others have learned from experience in this area. Thanks for reading!
5 thoughts on “Website Data Backups with Version Control”
I don’t know what you specialize in but I pretty much use WordPress for everything I do. One way you can backup is by creating a GMAIL email address (You get a ton of FREE and FAST storage). I then followed this technique (http://weblogtoolscollection.com/archives/2005/03/08/wp-backup-to-email-script/) to backup my websites so that it is backed up automatically. I also setup the my development machines so that all my development happens on one machines so I find that backing it up is really easy as well. Take a look at it here: (http://jaredheinrichs.com/the-ultimate-ubuntu-wordpress-development-machine.html). Hope that helps!
Thank you for the suggestion Jared,
I’ve been considering Gmail as well. It really seems reliable and is also free, but it has an attachment limit of 25MB, so I’ve been discounting it so far. However, now that you mention it again, I don’t see any reason why the backup script couldn’t specify how large each piece of the archive should be (like you see with .rar, .r00, .r01 files sometimes), and send larger backups to gmail in pieces. I read that script you linked, but I don’t think it does that.
I’ll probably look into that a little bit more – a quick Google lookup yielded http://jamesmcdonald.id.au/it-tips/gnu-linux/linux-tools/using-linuxcygwin-to-split-files-into-chunks-for-transfer-to-cd as a tutorial on splitting archives and putting them back together.
Having one development machine is really nice in my experience as well, it makes scripting common tasks easier.
Actually, I also like to use WordPress for making websites quite a bit. If you would like to connect with people in Winnipeg that work with WordPress, check out:
It’s a nice idea for sure. One element you haven’t mentioned is the database schema, which probably belongs in the code repo, rails-migrations-style. Also there may be some preset data in there that only the developer changes, which seems more like code too in terms of backup and restore. Another thing is config – if your site needs Apache config directives or php.ini stuff a certain way you better back up and restore that too.
If you tag your production data backups with the VCS version tag, you can be sure you won’t restore data that’s incompatible with your code or schema.
Also, don’t forget your restore plan – that’s the most important part of your backup system. If you lose a whole server you might be restoring 20 or 30 sites so the restore instructions need to be reliable and followable by anyone. You also have a couple of restore scenarios – total loss vs. the customer deleted one item and needs it back.
I guess you also want your own copy of the data in case s3 is unreachable.. You can get something that implements the s3 API locally so you only code to one API for both backups.
Last thing – you can gpg encrypt the data when you back it up to S3. That’s worth careful thought – it gives your clients more security but it adds one more tricky way for your backups to be unrecoverable.
Anyway, hope you have enjoyed these disconnected and incoherent thoughts.
Thank you very much for your thoughts Mark. I value them highly even if they’re a rough draft.
I believe that because I am looking at MySQL dumps, which contain the CREATE TABLE statements and other DDL pieces, I do have the database schema available with the data. As well, since I would be using a version control system, then backing it up to S3, it seems that the local copy is available. Would that cover those?
For a restore plan, here’s what I’ve come up with:
– For single files, maybe I set something like ViewVC up? Otherwise even knowing how to use the version control system might provide the ability to get a small amount of files.
– For entire websites, some sort of an automation of deployment of code, database and files could be compounded to provide the restore script?
What do you think?
Re MySQL data dumps – that does link up your data and schema, but it’s probably wise to tag the dumps with your VCS revision number, since the MySQL backup doesn’t protect you from restoring data and schema that’s incompatible with your code. That’s not that big a deal for your customer’s main site code, since latest data-latest code will generally be true. I’m imagining a scenario where you have, say, some internal tweaks that you make to your CMS, and where you might not roll all the tweaks out to all of your customers’ sites. If you have to restore one and you just grab the latest patches, you could run into trouble. It’s up to you how farfetched that potential case is for you.
For the restore scenario, I think that makes sense, although restoring a single file is probably going to more often be something a customer has deleted or trashed rather than something a dev has done. I do like the automation idea – like “crash-only” software, if you have a really good backup and restore process, you can treat almost any little problem as a crash, and simply do a redeploy and be back up and running right away.