waffle

Dreamhost Nap

That was interesting.

Waffle and waffle software happened to be the only two sites in my hosting account to be hosted on a specific Apache box deep in Dreamhost’s cluster. This server was to be assigned a new IP address. A script was run to assign the new DNS A-record binding as well, but the script crapped out before it got to those two domains, and so there I was without the upgraded rule, DNS still pointing to the old IP address. This is why the site was down for now almost two days.

There are a number of extenuating circumstances:

  • The server IP remapping happens regularly as new servers arrive. This kind of thing had been done tens of times for all of the boxes serving any domain or subdomain on my account, or on the average account.

  • The script responsible for the remap was one of the most time tested scripts in the infrastructure, to the point where such a failure was nearly unheard of.

  • I noticed this yesterday already but I decided to wait it out, thinking the problem was on my end. When I finally did report this a few hours ago, the support I received within minutes of pointing out the issue was courteous, clear, described the problem, described the application of the solution and how long it’d take to kick in, and assured that the cause of the glitch it was being looked into in order to be fixed. They solved the problem and they apologized, both thoroughly.

Now. The reason I take this lightly (it is, after all, now two days since my last successful waffle visit) is because it clearly was a black swan. There’s a lot of yapping about how Dreamhost is a cheap host, but not even an expensive host can protect itself from everything; for a cheap host with lots of custom scripts and infrastructure, Dreamhost sure stays stable to those ends. Even if the script itself is thoroughly vetted — by which I mean for example line-by-line audits for security, stability, side-effect and concurrency aspects — upgrades on dependencies or hardware failure or limits can still alter the behavior of it.

That doesn’t absolve them of responsibility, but since they acted so swiftly and the story about the mechanical error checks out, it appears that I’m a victim of being the statistically insignificant fraction that make up things that will eventually go wrong, no matter what, given sheer numbers and time.

I have two serious concerns remaining:

  • Why didn’t the script failing make a bigger noise?
  • Why, in that case, didn’t a consistency-checking script ensure that all domains were properly mapped?

I hope that Dreamhost improves their infrastructure to more properly deal with the underlying trigger and maybe finish off with some script that can ensure that everything’s really, actually online. That would have spared me the inconvenience, and might well someday save someone with something intelligent to say or some real money to lose.

No comments yet.

Sorry, the comment form is closed at this time.