Thursday, March 20, 2008

The most expensive bug


1996 june 4th, the first european rocket Ariane V exploded 40 seconds after launch. The payload alone cost about US$370 million. The cause ? A bad cast in the software initiated a chain of dramatic errors and led to Ariane destruction.

The full report worths the reading, here is a summary.

The attitude of the rocket is given by an Inertial Reference System (IRS), which is a combination of gyro lasers and accelerometers. This critical piece of hardware sends a stream of data about position, height, speed and acceleration to the main computer, which controls the exhaust pipes and drives the rocket along its expected trajectory.

Ten years ago, on Ariane III, a software function performed pre-flight checks to test alignment of the IRS. This function was no longer used on Ariane IV, but still ran during take off. You know it's easier to leave harmless code than to remove it. This function used 8 variables, 3 of them were not correctly protected although it was not an issue, because the rocket trajectory remains in range of these 3 variables.

No surprise, this function was still working on Ariane V. Unfortunately, Ariane V trajectory was a bit different and now one of the variables, the horizontal velocity, casted from 64bits float to 16bits integer, went out of range and raised an uncaught exception.

So far, no big deal. A check function raised an exception. Let's forget the check function and resume the mission.

However, the assumption on Ariane design was that software is always right and hardware may fail. The software reported an error, interpreted as the SRI was out of order. Then the SRI was shut down.

That's probably the biggest mistake. A failing unit test is embarrassing enough, but doesn't always mean the software is out of business. In this case the SRI still delivered reliable information. Unplugged, it couldn't any more.

The backup SRI started providing replacement data, and was shut down 0.05 second later, because of the same bug. Once again the assumption "hardware may fail, software not" made the backup SRI totally useless in this case.

Without sensible guidance, the rocket was doomed. But to accelerate the disaster, the SRI modules started to send stack traces instead of normal data to the main computer. The computer interpreted the data just as if the rocket was upside down and went into an emergency half turn. It started to tear apart under the physical constraints and initiated self-destruction process.

The story is sad enough like that, no need to add that a suitable test or full simulation before the flight would have found the bug.

By chance, I knew one of the member of the investigation team. He told me something not in the final report: what greatly contributed to kill Ariane V is the absence of experimented computer scientists at top management. The sofware components were simply divided and individually conducted. A competent software supervisor with suitable power could have found one of the errors, and prevented the cast exception to eventually stop the delivery of correct IRS data.

But Ariane was a physicist toy, they didn't share with a software department...

PS: The lesson was positively received. Today Ariane V is very successful and crashed only once more in 37 flights.

No comments: