Water supply, in the developed world, is one of those things you just pretty much take for granted. You turn on the tap, and clean, cool, refreshing water comes out. Similarly, when I go to a fire in my shiny fire truck, no matter what time of the day or night, I expect to able to hook up to a hydrant and have a strong, steady supply of water available.
This level of reliability is something that we in the IT industry can typically only dream of. Practically 100% reliability over a period of decades, without constant maintenance and tweaking? Not a hope. To even get close to that, we need clusters of fully-redundant servers, fancy database replication techniques, and probably something totally out of left-field like Erlang's ability to reload code on the fly.
But how do the water utilities do it? Clusters of fully-redundant high-capacity pumps, fancy pipe re-routing techniques? Or something totally out of left-field like a big water tank and gravity?
Where I live, we've got the latter. Thinking about how it probably operates and is managed, I'm blown away by the sheer simplicity and robustness of the whole design, and how it can handle all manner of failures.
First off, consider how few components have to be working in order for the water supply to continue for a while after some sort of catastrophic failure. We need:
- Gravity. The water tank is at a high point, and gravity (and the derived force of air pressure) keeps the pipes pressurised. While gravity can, in theory, fail, if it actually does then we've got bigger problems.
- The water tank must be present and with at least some capacity to hold water. Yep, we can blow a hole in the top or side of the water tank and still have some (diminished) capacity to supply water. Try blowing a hole in the side of your data centre and see what happens.
- Piping from the water tank to the water consumption point. I'd say this is the weakest link, but it's a pretty robust link overall (modulo rampaging backhoes) and any other centralised water supply method (other than perhaps teleportation) is going to be at the mercy of a piping failure too.
Note that this list of elements doesn't include any moving parts, or even guaranteed continuous water supply from a dam or other huge supply store. You can lose your feed pump(s), supply lines, or anything else that's on the supply side of the water tank for some period of time and nobody on the consumption side will know or care. Need more resilience to supply-side failure? Just build a bigger tank.
This blows my mind, it really does. Need to do pump maintenance? No problem, just make sure that the tank is large enough to service demand over the period of the "outage", and go for it. Can't find a pump supplier to give you a restoration SLA of less than a week? Just make sure you've got a water tank that'll provide for a week's consumption.
Basically, any supply-side reliability problem can be solved with "build a bigger tank", and while there's a limit to how big we can make tanks, I'll bet we know a lot more about building huge water tanks to withstand some freaky failure conditions than we do about building pumps that won't fail for 100 years. If you do need more capacity than a single tank can provide, just parallelise -- horizontal scaling of the water supply. Woohoo!
Your big water tank also provides cost savings in your other equipment. If you had to pressurise a water supply using pumps, not only would you need your redundant array of very expensive pumps (Hmm, RAVEP-5 sounds like a Doctor Who villain), but those pumps would need to be able to provide your peak consumption flow (toilet breaks during the Super Bowl, probably). I'd imagine that could get mighty expensive, and without providing much benefit for 99% of the time.
I see this capacity problem at work all the time. Customers who have the occasional massive traffic spikes need to massively over-provision their average utilisation to successfully service that 1% of the time that they're doing heavy traffic. Yes, I know, cloud computing, horizontal scaling, capacity on demand, yadda yadda yadda. It's not a panacea, and the number of apps that are designed to properly scale horizontally over a large range of traffic volumes is miniscule. My general point, though, is that it when you've got a variable load, it costs a whole lot more to provision systems to supply peak demand than to provision systems that can deal with average demand.
A suitably large water tank means you can easily deploy a much smaller capacity feed pump. All you have to do is make sure that when demand exceeds supply, you've got enough water in your tank to cover the difference between the demand and supply over that period. When your peak capacity increases, and is starting to strain your infrastructure, you've got a choice, too: you can upgrade the pump or increase your storage capacity, whichever is cheaper / easier / quicker / provides better kickbacks / whatever.
This is all well and good, you say, but computer systems aren't water supplies. There's a lot more moving parts, inputs, and outputs, and those all have to be handled. This is quite true. Water-powered computers are not in high demand. However, I think we could produce some much more reliable systems if we looked for ways we could simplify capacity and redundancy issues, water supply style, instead of layering more and more cruft into our solutions.
Ironically, just after I started writing this post (and this gives you an idea of how long it's been sitting in my drafts folder), I saw an article in the RISKS digest about a water supply problem in Santa Cruz, caused by a power outage killing the refill pump for an extended period and resulting in the storage tank running dry. Hey, I never said that it was a foolproof system -- especially when you've got failures of imagination that result in the system that tells people that there's no power relying on the power source it's monitoring being up in order to be able to tell people that there's no power...