AS. Could you describe a training scenario on the
SMS that caused a problem for you?
Clemons. Yes--it was a "bad-news-good-news" situation.
In 1981, just before STS-2 was scheduled to take
off, some fuel was spilled on the vehicle and a number
of tiles fell off. The mission was therefore delayed for a
month or so. There wasn't much to do at the Cape, so
the crew came back to Houston to put in more time on
the SMS.
One of the abort simulations they chose to test is
called a "TransAtlantic abort," which supposes that the
crew can neither return to the launch site nor go into
orbit. The objective is to land in Spain after dumping
some fuel. The crew was about to go into this dump
sequence when all four of our flight computer machines
locked up and went "catatonic." Had this been
the real thing, the Shuttle would probably have had
difficulty landing. This kind of scenario could only occur
under a very specific and unlikely combination of
physical and aerodynamic conditions; but there it was:
Our machines all stopped. Our greatest fear had materialized--
a generic software problem.
We went off to look at the problem. The crew was
rather upset, and they went off to lunch.
AS. And contemplated their future on the next mission?
Clemons. We contemplated our future too. We analyzed
the dump and determined what had happened.
Some software in all four machines had simultaneously
branched off into a place where there wasn't any code
to branch off into. This resulted in a short loop in the
operating system that was trying to field and to service
repeated interrupts. No applications were being run.
All the displays got a big X across them indicating that
they were not being serviced.
AS. What does that indicate?
Macina. The display units are designed to display a
large X whenever the I/O traffic between the PASS
computers and the display is interrupted.
Clemons. We pulled four or five of our best people
together, and they spent two days trying to understand
what had happened. It was a very subtle problem.
We started outside the module with the bad branch
and worked our way backward until we found the code
that was responsible. The module at fault was a multipurpose
piece of code that could be used to dump fuel
at several points of the trajectory. In this particular
case, it had been invoked the first time during ascent,
had gone through part of its process, and was then
stopped by the crew. It had stopped properly. Later on,
it was invoked again from a different point in the software,
when it was supposed to open the tanks and
dump some additional fuel. There were some counters
in the code, however, that had not been reinitialized.
The module restarted, thinking it was on its first pass.
One variable that was not reinitialized was a counter
that was being used as the basis for a GOTO. The
code was expecting this counter to have a value between
between
1 and X, say, but because the counter was not
reinitialized, it started out with a high value. Eventually
the code encountered a value beyond the expected
range, say X + 1, which caused it to branch out
of its logic. It was an "uncomputed" GOTO. Until we
realized that the code had been called a second time,
we couldn't figure out how the counter could return a
value so high.
We have always been careful to analyze our processes
whenever we've done something that's let a discrepancy
get out. We are, after all, supposed to deliver
error-flee code. We noticed that this discrepancy resembled
three or four previous ones we had seen in
more benign conditions in other code modules. In these
earlier cases, the code had always involved a module
that took more than one pass to finish processing. These
modules had all been interrupted and didn't work correctly
when they were restarted. An example is the
opening of the Shuttle vent doors. A module initially
executes commands to open these doors and then
passes. A second pass checks to see if the doors actually
did open. A third pass checks to see how long time has
run or whether it has received a signal to close the
doors again, etc. Important status is maintained in the
module between passes.
AS. Isn't flight control multipass?
Clemons. Yes, in a broad sense. But every pass
through flight control looks like every other. We go in
and sample data, and based on that data, we make
some decision and take action. We don't wait for any
set number of passes through flight control to occur.
For the STS-2 problem, we took three of our people,
all relatively fresh from school, gave them these discrepancy
reports (DRs) from similar problems, and
asked for help. We were looking for a way to analyze
modules that had these multiple-pass characteristics
systematically. After working for about a week and a
half, they developed a list of seven questions that they
felt would have a high probability of trapping these
kinds of problems. To test the questions, we constructed
a simple experiment: We asked a random
group of analysts and programmers to analyze a handful
of modules, some with these type of discrepancies,
some without. They found every one of the problems
and gave us several false alarms into the bargain. We
were confident they had found everything.
We then called everybody in our organization together
and presented these results. We asked them to
use these seven questions to "debug" all of our m0dules,
and ended up finding about 35 more potential
problems, which we turned into potential DRs. In many
instances, we had to go outside IBM to find out whether
these discrepancies could really occur. The final result
was a total of 17 real discrepancy reports. Of those,
only one would have had a serious effect.
It turned out that this one problem originated during
a sequence of events that occurred during countdown.
A process was invoked that could be interrupted if
there was a launch hold. The only way it would be
reset to its correct initialization values was if a signal
was sent from the ground when the launch process was
restarted. We incorrectly assumed that this signal was
always sent. Had we not found this problem, we would
have lost safety checking on the solid rocket boosters
during ascent. We patched this one for STS-2 right
away.
In retrospect, we took a very bad situation and
turned it into something of a success story. We felt very
good about it. This was the first time we'd been able to
analyze this kind of error systematically. It's one thing
to find logic errors, but in a system as complex as this,
there are a lot of things that are difficult to test for.
Despite a veritable ocean of test cases, the combination
of requirements acting in concert under certain specific
conditions is very difficult to identify, let alone test.
There's a need for more appropriate kinds of analysis.
more
NASA handbook for programming in HAL/S
Thursday, September 30, 2010
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment