We were talking about backups recently, and I remembered one of the things we did at MindCandy to great success that I now consider essential to any online game which has a live database: use your live backup as the input for all your development machines.
NB: this is all about backing up LIVE SERVERS, not backing up your development machines – if you can’t make your development backups work perfectly then you have serious problems and need to get some better IT personnel. I’m not covering that issue at all: that’s just standard to any technology company!
The problem with backups
Is simply this: do you KNOW that when you really need them you will actually be able to restore them?
How do you know? Have you *checked*? What have you checked – have you personally tried *every single backup* the moment the backup was completed, and checked that *nothing* was missed out?
In almost all cases the answer is “of course not; that would take ridiculous amounts of time”.
What’s in a backup?
You’re looking for a couple of things:
- Can you restore at all? (it doesn’t matter what you have backed up if your attempt to restore produces nothing for some reason)
- Did anything back up at all? (did the backup process itself even run (or did it crash and produce no output at all)?)
- Is the backup complete? (did EVERYTHING get backed up?)
- Is the backup consistent? (if you tried to restore the backup right now, would it work out of the box, or does it have some internal corruption, version mismatches, inconsistent data etc that needs to be fixed by hand?)
Some people approach this by writing Unit Tests *for their backups*. That’s not a bad idea, but unless you write a whole batch of unreusable tests to check every aspect of the data, AND keep them up to date, it really only tests whether the backup is “consistent”.
The problem with MMO backups…
…is that you only want to backup three things:
- the configuration of each server (config files, OS settings, etc)
- all the persistent data in the game (player accounts, what level each character is, what items each character owns, etc)
- all the gathered statistics (how many people are playing the game, where people spend their time, how many people killed monster X today, etc – all the gameplay metrics)
The first one changes rarely, and is easy to backup and to MANUALLY CHECK: this should be part of your automatic build process anyway, but some companies (unfortunate ones) still maintain their live servers by hand.
The last one generates in the region of a tens of gigabytes a day of data for a moderately successful MMO, and … frankly, there’s no real way of checking it. If you lose it, the game doesn’t suffer, but your designers and marketing people will be pissed off. You can apply the same process to this data as I’m going to mention below, but it’s not business-critical (for most companies).
The middle one is the really important one: if you lose this data, you lose all your customers. This is the core of the game: the personal advancement of each and every player.
It’s also the really nasty one. By definition, NONE of that data exists on the development servers and workstations, none of it exists inside the office of the game developer – because unlike everything else inside your game (3d models, textures, quests, game logic, AI, etc) that is authored inside the dev studio, this data is created on the live servers, by the players.
Eating (your own) dogfood
It’s pretty standard to take a full backup of your live servers every 12 or 24 hours, and partial/incremental backups either every 15 to 30 minutes, or even every minute or every second (if you can).
As it turns out, in all MMO game development you have more than one copy of the gameservers. You have:
- DEVELOPMENT: inside the studio, these change (and break!) every few minutes, as people actively update and alter the game code or add new art assets etc
- STAGING: inside the studio, a “stable” build that changes perhaps once a day or perhaps once an hour. This is where people test things that they believe work
- QA: like staging, but only gets a new build when the programmers + artists are “convinced” that the code works, and they want the QA dept to test a specific version
- LIVE: if QA approves a given build, then it gets put live. Any mistakes at this point hurt the players
(a lot of studios have some other stages, e.g. some have PUBLIC TEST servers where stuff that is due to go live “very soon” can be given a last phase of testing by players and the dev team can get feedback from real players. This is a great place to try out “experiments” and see what the players think before you make it official)
Usually, you need several sets of faked data internally to simulate having players in the world. Usually, these are programmatically generated, or made by QA when they’re not busy doing other stuff. They are just “test data”.
So, what we did was to use our nightly backup from last night to overwrite the data on STAGING. This was the ONLY way that data could be put on STAGING, other than changing it by running quests and things on the STAGING server itself.
This bought us two things:
- If you added lots of stupid annoying crap (e.g. a million chairs dotted around town) one day, you didn’t have to worry about clearing it up: it would disappear the following morning
- If there was ANY failure with the backup OR the restore, the entire dev team would notice during their day to day work
Rather than have one person devoted to “testing” each backup + restore, everyone was exposed to the backup daily, and was effectively checking it. This was really helpful when small parts of the backup failed – for instance, sometimes we’d add a feature in an odd / hacked together way, and just one small table in the database wouldn’t correctly restore. We’d never have noticed this with unit tests, but the programmer working on adding even more to that feature immediately noticed that all his results seemed out of date. One quick check of the DB later, and we could see that some data was missing.
I’m a huge fan of this kind of “invisible self-testing”, whereby you don’t have to rely upon any individual to perform a repetitive task and/or do long, boring, testing, and instead merge it into your daily development process in such a way that you get it “for free”. If you don’t already do this, I highly recommend you do it – or an equivalent – for every frequently-changing system you have. It caught plenty of bugs and problems due to external systems long before they could bite us in the ass.
(this post is another one about improving your dev process)
2 replies on “MMO Backup and Restore”
The only potential problem I see there is if you wanted to test something on staging and the backup process for the day had failed for whatever reason…
This is partly a scheduling issue, of course.. (staging’s down today? Okay, do…)
Yes, my unstated assumption here is that a failed restore is a severity-1 issue that you drop everything to fix.
Its a good idea to run some unit tests on the about-to-restore data before overwritig your current staging DB’s just to sanity check it, but IIRC we also kept a “known good” copy of recent live data, e.g. from a week ago, useful for cases like you describe – while someone looked at what had gone wrong with the live backup, which usually didnt need them to do anything to staging (the failure is very rarely at the staging server) we could quickly get a mostly working staging back up.