Sat, 13 Jun 2009

The Rise and Fall of Wordperfect

It's been sitting in my browser for so long that I've forgotten where I got it from, but I've just finished reading The Rise and Fall of Wordperfect Corporation, written by W. E. "Pete" Peterson, one of the very early employees of the company, and one of the top people there for many years.

It's a really interesting book, giving some insights into what it was like in the early days of "the microcomputer revolution", and with some solid observations of how to (and how not to) run a software company, and manage a very rapidly growing organisation. I'd recommend it to computer historians, software development managers, and business owners.



posted at: 22:57 | category: /general | permalink


"Tx unit hang" in e1000 driver

In the "blogging it so I don't forget about it" category, and also to try and give Google some hints, here's one from the vault I recently came across again...

Older e1000 chips (specifically the 82573(V/L/E)) have firmware which enables the (buggy) power management functionality in the chips. Although this shouldn't occur with hardware shipped in the last several years (the updated firmware was released ages ago), I just came across it recently in a supposedly-new server, so who knows what's going on there...

The problem manifests itself as flaky or unstable network performance, combined with the kernel / log messages:

	Detected Tx Unit Hang
	NETDEV WATCHDOG: ethX: transmit timed out

The fix is as described in the "82573(V/L/E) TX Unit Hang Messages" section of the Intel Linux driver documentation for the e1000 driver. The reason I'm blogging about it is that the top Google results point to either the now-dead e1000.sf.net wiki (which used to have all the necessary info, but now doesn't), or else forum posts which point to the now-dead e1000.sf.net wiki. Either way, it's not trivial to find a working location for the fix, so... linky love time!



posted at: 04:01 | category: /general | permalink


Mon, 08 Jun 2009

Quote of the Day

From a post on StorageMojo:

Lustre is not a product I would recommend since it was designed for HPC, a market where PhDs work as sysadmins.

While I've not used Lustre itself, the "market where PhDs work as sysadmins" (and the often less-than-entirely-spectacular results) are something I've dealt with more than once.



posted at: 02:36 | category: /general | permalink


Tue, 02 Jun 2009

Local HTML docs for Prototype

Having (temporarily) flung myself back into a bit of professional Rails app development, I'm doing a JS-heavy site (just to maximise the WTF factor). Since I'm Old Skool, I'm going with Prototype, as it's the javascript library I'm least unfamiliar with.

The problem, though, is that I do about a fifth of my work on the train, where I don't have access to the Internet, and the Prototype people haven't seen fit to provide a downloadable version of the docs in HTML. There's a PDF[1], (which lacks any sort of hyperlinking (how very 1990s) and whose page numbering in the TOC doesn't match up to the PDF page numbers), and a CHM which has hyperlinks that don't work half the time. Apart from that, all the docs are online-only, and not very scraping-friendly (wget --mirror doesn't rewrite links correctly when you're also mangling the filenames for local-friendly access -- patch in the making if I can be arsed).

So, long story short, I need a non-traditional way to get myself some local docs. Turns out it isn't as hard as it sounds, as evidenced by the below command list:

	git clone git clone git://github.com/sstephenson/prototype.git
	cd prototype
	git submodule init
	git submodule update
	sudo gem install treetop
	rake doc:build

This will peg your CPU for an inordinate amount of time (a smidge over 10 minutes for me; -j 3 would have made things a lot quicker), but you'll end up with a doc/index.html which has some useful stuff in it. It'll look nothing like what's on the website, but it's something, and should be enough to keep me going on those long, lonely train journeys.


1. which looks like someone should be expecting a C&D from O'Reilly for look and feel, unless it's a semi-sooper-sekrit O'Reilly production that they don't want to put their name to directly.



posted at: 22:43 | category: /general | permalink


Fri, 15 May 2009

Water Tanks, Reliability, and Redundancy

Water supply, in the developed world, is one of those things you just pretty much take for granted. You turn on the tap, and clean, cool, refreshing water comes out. Similarly, when I go to a fire in my shiny fire truck, no matter what time of the day or night, I expect to able to hook up to a hydrant and have a strong, steady supply of water available.

This level of reliability is something that we in the IT industry can typically only dream of. Practically 100% reliability over a period of decades, without constant maintenance and tweaking? Not a hope. To even get close to that, we need clusters of fully-redundant servers, fancy database replication techniques, and probably something totally out of left-field like Erlang's ability to reload code on the fly.

But how do the water utilities do it? Clusters of fully-redundant high-capacity pumps, fancy pipe re-routing techniques? Or something totally out of left-field like a big water tank and gravity?

Where I live, we've got the latter. Thinking about how it probably operates and is managed, I'm blown away by the sheer simplicity and robustness of the whole design, and how it can handle all manner of failures.

First off, consider how few components have to be working in order for the water supply to continue for a while after some sort of catastrophic failure. We need:

Note that this list of elements doesn't include any moving parts, or even guaranteed continuous water supply from a dam or other huge supply store. You can lose your feed pump(s), supply lines, or anything else that's on the supply side of the water tank for some period of time and nobody on the consumption side will know or care. Need more resilience to supply-side failure? Just build a bigger tank.

This blows my mind, it really does. Need to do pump maintenance? No problem, just make sure that the tank is large enough to service demand over the period of the "outage", and go for it. Can't find a pump supplier to give you a restoration SLA of less than a week? Just make sure you've got a water tank that'll provide for a week's consumption.

Basically, any supply-side reliability problem can be solved with "build a bigger tank", and while there's a limit to how big we can make tanks, I'll bet we know a lot more about building huge water tanks to withstand some freaky failure conditions than we do about building pumps that won't fail for 100 years. If you do need more capacity than a single tank can provide, just parallelise -- horizontal scaling of the water supply. Woohoo!

Your big water tank also provides cost savings in your other equipment. If you had to pressurise a water supply using pumps, not only would you need your redundant array of very expensive pumps (Hmm, RAVEP-5 sounds like a Doctor Who villain), but those pumps would need to be able to provide your peak consumption flow (toilet breaks during the Super Bowl, probably). I'd imagine that could get mighty expensive, and without providing much benefit for 99% of the time.

I see this capacity problem at work all the time. Customers who have the occasional massive traffic spikes need to massively over-provision their average utilisation to successfully service that 1% of the time that they're doing heavy traffic. Yes, I know, cloud computing, horizontal scaling, capacity on demand, yadda yadda yadda. It's not a panacea, and the number of apps that are designed to properly scale horizontally over a large range of traffic volumes is miniscule. My general point, though, is that it when you've got a variable load, it costs a whole lot more to provision systems to supply peak demand than to provision systems that can deal with average demand.

A suitably large water tank means you can easily deploy a much smaller capacity feed pump. All you have to do is make sure that when demand exceeds supply, you've got enough water in your tank to cover the difference between the demand and supply over that period. When your peak capacity increases, and is starting to strain your infrastructure, you've got a choice, too: you can upgrade the pump or increase your storage capacity, whichever is cheaper / easier / quicker / provides better kickbacks / whatever.

This is all well and good, you say, but computer systems aren't water supplies. There's a lot more moving parts, inputs, and outputs, and those all have to be handled. This is quite true. Water-powered computers are not in high demand. However, I think we could produce some much more reliable systems if we looked for ways we could simplify capacity and redundancy issues, water supply style, instead of layering more and more cruft into our solutions.


Ironically, just after I started writing this post (and this gives you an idea of how long it's been sitting in my drafts folder), I saw an article in the RISKS digest about a water supply problem in Santa Cruz, caused by a power outage killing the refill pump for an extended period and resulting in the storage tank running dry. Hey, I never said that it was a foolproof system -- especially when you've got failures of imagination that result in the system that tells people that there's no power relying on the power source it's monitoring being up in order to be able to tell people that there's no power...



posted at: 08:34 | category: /general | permalink


Fri, 01 May 2009

The Accounting Equation

Not a usual topic of blogging, but most of us need to keep track of our money somehow (which I do with the help of the most excellent Ledger command-line accountingbook-keeping tool). I've just been pointed to a really great description of The Accounting Equation, which is the basis of double-entry book-keeping. Helped to straighten a few things out in my head.



posted at: 12:11 | category: /general | permalink


Sun, 26 Apr 2009

~/porn

It seems there are plenty of people who do take careful note of my scribblings, even down to noticing the apparent directory I'm in when I SSH to not-yet-existent SheevaPlug-based serial console concentrators. For the record:

Thanks, however, to those people who noticed and let me know; if I ever do make an embarrassing cut-n-paste gaff, at least I'll know about it quickly.



posted at: 22:01 | category: /general | permalink


Thu, 23 Apr 2009

Insane/Brilliant Idea of the Day

I've been talking serial consoles with a couple of the other guys at work: how nice they are to have for machines in the datacenter, how annoying it is that vPro serial-over-LAN doesn't seem to be robust (yet?), and how serial access concentrators are lung-and-kidney expensive (especially when you've got 50-some racks to outfit).

This discussion, combined with my ongoing embedded-hardware-fascination lust for a SheevaPlug appears to have spurred my brain into coming up with a Brilliant Idea: tie a SheevaPlug to a pile of USB to serial adapters and use that as your per-rack serial concentrator. Imagine: faffenheimer, a dedicated server you manage for a customer, and located in rack 27 of your DC, has just crashed, and you'd like to know WTF has happened rather than just blindly reboot, but you're in the office 15 minutes away from the DC floor, and the customer's going to want that machine back up and running pretty quickly.

workstation:~/porn$ ssh rack27.serial
rack27:~$ sconsole faffenheimer
[screen session attached]
[minicom running, shows the horror of a kernel crash dump]
[oh look at that, kernel bug]
^A ESC
[pgup pgup]
[enter]
[pgdn pgdn]
[enter]
^A >/tmp/faffenheimer-crash-dump
^A d
rack27:~$ exit
workstation:~/porn$ scp rack27.serial:/tmp/faffenheimer-crash-dump ~
workstation:~/porn$ powercycle rack27 faffenheimer

Shiny! We got a crash dump in a minute or so (rather than having to take phonecam photos of KVM screens in the DC), never had to leave our comfy seat, and the machine's on it's way back up. We're now free to pursue diagnostic activities on that crash dump at our leisure.

10 minutes later, the downtime for faffenheimer that was automatically set when we ran powercycle runs out and Nagios sends us threatening messages. Hmm, something's gone wrong here. Back into the console...

workstation:~/porn$ ssh rack27.serial
rack27:~$ sconsole faffenheimer
[screen session attached]
[Boot is hung waiting for root password after initrd has bombed]
[Type root password]
[Oh look, the root MD appears to have come asunder]
[clickety-click... fixee fixee]
[reboot]

The more I think about this, the more I reckon I'm onto a bona fide winner. The sheevaplug is a powerful ARM-based system with USB/ethernet/SD ports that is packaged literally in it's own power supply wall wart -- it's a plastic box with power plug prongs poking out the side. That's all there is to it. The USB to serial adapter things are likely to be a bit more of a pain, but I've played with enough of them by now to not be too scared. So, you plug the Sheeva into a power socket, plug an Ethernet cable and USB hub into the Sheeva, configure things a bit so that the system knows which serial adapter maps to which machine, and you're away. Oh, and the best bit: the Sheeva apparently draws as little as 2W when idle. A whole datacentre's worth of serial goodness for about a server's worth of power. The cost per rack should be somewhere below AU$250, especially in bulk.

Let's see if I can convince work to spring for a Sheeva, a USB hub, and a half dozen or so USB to serial adapters to test this whole thing out. Given that the whole thing looks like it'd cost less than AU$250 (plus my R&D time), I can't imagine it'll be too hard a sell to at least give it a go. Watch this space...



posted at: 23:37 | category: /general | permalink


Wed, 22 Apr 2009

from idiot import *

I would like to propose that anyone who uses the following code to import a pile of cruft into a Python module be beaten, shot, drowned, hung, drawn, quartered, tarred, feathered, and made to watch a Vin Diesel marathon:

	from module import *

That is all.



posted at: 22:24 | category: /general | permalink


Really, Really Distributed Revision Control

While I'm not a fan of cut-n-paste coding, on the odd occasion it's handy to grab a snippet of code out of someone else's blog and plop it in.

Jeff Atwood of Coding Horror has a solution to some of the downsides of ripping a bit of code from somewhere on the Internet: publishers of code snippets generate a new GUID, tag each code snippet with a GUID, and you paste that comment in with the rest of the code when you take it. Then, if you (or someone else) needs to look at the context of the chunk of code, find it's author, look for improvements or commentary, or see who else has used that snippet in their own projects, you can search for that GUID and you should come up with only that GUID as a result.

What I propose is this:

// codesnippet:1c125546-b87c-49ff-8130-a24a3deda659
- (void)fadeOutWindow:(NSWindow*)window{
        // code
}

Attach a one line comment convention with a new GUID to any code snippet you publish on the web. This ties the snippet of code to its author and any subsequent clones. A trivial search for the code snippet GUID would identify every other copy of the snippet on the web:

http://www.google.com/search?q=1c125546-b87c-49ff-8130-a24a3deda659

How very, very cunning. He also proposes that if you modify a snippet and republish it, you should keep the old GUID and also add one of your own, so you can track the origin and help other people find your alternate version.

Attaching unique identifiers to chunks of code and sending them all around the Internet. Does this sound like distributed revision control to anyone else? A lot easier to master than Git's UI, too...



posted at: 09:45 | category: /general | permalink