Womble's Word of the Week: Retrofit!

Posted: Thu, 12 June 2008 | permalink | No comments

In a perfect world, when we design some aspect of our infrastructure, we'd get it right the first time, and would never have to change our design once it had been deployed.

Back in this world, however, we regularly have to change things as we expand our infrastructure. Sometimes we made a mistake originally, other times new requirements pop up. In either case we will end up doing things differently in the future than how it was done in the past.

The danger here is that if you only create your new machines or services differently, you've got effectively two separate and different systems to maintain. This may not sound like much, until it happens again... and again... and again... and then suddenly you've got 18 different ways of setting up backups, depending on how long ago you set it up and (if you're really unlucky) who did it and what the phase of the moon was at the time.

If you've been here before, you probably remember the pain of having some disaster happen and needing to fix it NOW NOW NOW, only to wade into the setup and realise that it's been done in a way that (you thought) died out six months ago -- and so you don't really remember how it works, and yeah, the documentation's been updated so many times since then that it doesn't mention the old way of doing things, and here comes your boss to ask how long it's going to be until it's all working again...

If you haven't had this experience (yet?), you'll have to trust me when I say that it's no fun to hit an older, out-of-date system. It's especially painful because the most common reason you change things is because the old way sucked. So you turn up at a critical moment to fix something in a poorly documented and sub-optimally configured system.

Fun, fun, funnity fun.

There are only two ways to avoid this problem: never change the way you do anything, or else plan your changes so that retrofitting is part of the plan every time you're modifying an existing service.

Assuming that never changing anything isn't an option, you need to plan to retrofit. When you're putting together the design for your system change (you do design your changes, right?) you need to audit what's out there, determine what's needed to bring it all up to date, and put together a plan to make that happen. It's like documentation and testing -- if you don't make an explicit effort to make it happen, it'll never get done.

That sounds like a lot of work, and it is. Luckily, if you're already working in a well-maintained infrastructure you should only really have one existing config to upgrade, which shouldn't be too painful. However, if you've got a lot of machines to change there's no way around the fact that they've all got to be changed, and that can take time.

If you accept that you don't want configuration skew, you can start to think about how to engineer your systems and processes to make it less of a hassle to keep everything up to date. Putting some extra thought into how you set things up in the first place so that it's easier to update is at the heart of this. Automating deployment is an easy way to take some of the pain away from rolling out new configs, but there's also plenty of smaller decisions that help with updating configs: putting everything in revision control, making extensive use of templating and pulling out common config fragments, and making sure that you've got good notes and records of what has been done where.

The more you automate your deployment and maintenance, the less pain you feel when you need to retrofit. With something like Puppet, you can eliminate a lot of the pain because you're describing what you want done for all the relevant machines, and then the machines themselves work out what needs to be done to bring themselves up to date.

With an advanced automation system in place, you can say "I want all my mail servers to have setting X turned on" and off they go and set the option. Or you might want to have some of your mail servers to have a different config setting, triggered based on a config setting or class of machine.

Of course, you might not already have that config setting or class defined, so you'll need to add it. Don't forget to retrofit that new config setting or class to your existing machines. That might involve finding the config for all the existing machines and modifying them.

Yep, no matter how much automation you've got you can't get away from the need to retrofit, but given the choice between editing a bunch of config files on my local machine (with all the power of my text editor to back me up) or logging into dozens of machines all over the globe to edit their configs and restart services (with all the risks of making a mistake) I know which one I'd choose...

Post a comment

All comments are held for moderation; markdown formatting accepted.

This is a honeypot form. Do not use this form unless you want to get your IP address blacklisted. Use the second form below for comments.
Name: (required)
E-mail: (required, not published)
Website: (optional)
Name: (required)
E-mail: (required, not published)
Website: (optional)