Saturday, November 04, 2006

Crashing MailWorks

Another Bechtel story.

Bechtel had standardized on DEC MailWorks for their corporate email standard. Previous standard was PROFS on the mainframe. We had enough MailWorks users going that we needed a VMS cluster to deal with the volume, and have some redundancy in case of an outage, maintenance, etc... all the stuff you want a cluster for. I'm actually DEC certified on some of this stuff. I get lots of use out of that now, let me tell you.

One day, mail goes down. The senior VMS admins determine that the MailWorks server process had gone down. On all the machines in the cluster. At the same time.

OK. So they try to run it again, and it comes up. As they are trying to bring it up on another machine in the cluster, they both go down again. It had only been up for a few minutes. So they try one machine by itself. It runs for a minute or two, and goes down again.

They do some dump analysis, and can see that the process is crashing. Not that this helps with how to fix it. After a bit of in-house fiddling, DEC is called. Some phone support doesn't help, must be a hardware problem somewhere. On every box in the cluster? OK, a hardware problem in the cluster interconnect (CI), then. Waste time, cannibalize hardware, break cluster, determine that problem happens on one server, no cluster, and machine works fine for all other software. Dispatch DEC technician to site.

Reload OS, MailWorks software, runs clean. Problem solved? No, when you give it the mail spool, it crashes again. And yes, we DO need our old mail, thanks anyway.

But now we know it's something in our mail files that is causing it. Maybe we can figure that out and surgically remove it? OK, so they binary split the files, and determine that a single email is causing the problem.

This is several days into an outage, mind you.

Email is examined, and it turns out to have a really long subject line, like thousands of characters, almost all spaces. Some experimentation shows that once you hit a subject line of 1K or so in length, MailWorks takes a dive. (Ah yes, I saw that light bulb go off over your head.) And if you have a cluster, when one server crashes, the next one dutifully takes over mail processing, until it hits that same message.

Message is purged, and people can actually get back to work.

They track down the user who sent the killer email, to find out what the heck she was thinking. Turns out she was eating breakfast, and reading her email. A piece of Grape Nuts cereal lodged in her keyboard, and managed to hold the spacebar down. She still sent the email after that, but remembered having to dislodge the offending Grape Nut.

So an entire VMS MailWorks cluster got taken out for days by a piece of Grape Nuts. But that's not the punchline.

After DEC support had been largely useless for days and our guys had to more or less had fix it themselves, we submitted a fix request. We didn't want this happening again. We were able to send a specific problem description, number of characters, sample email, the whole bit.

DEC's response was: Oh yeah, we know about that! Here, we've had a patch available for a while. Why weren't we (one of the largest MailWorks installations in the world) told about that? Oh, uh... you have to call with a problem description that indicates that patch is needed. OK, and when we DID call with a problem like that? And you sent out a technician, why didn't he know? Why don't you publish the patch list? Uh.. well...

And I believe that was my first practical introduction to buffer overflows and vendor patching.

No comments: