Software Testing - when computer glitch kills - race condition error example

0 55
Avatar for Otek
Written by
3 years ago

Hello :)

I'm glad that You enjoyed my series about Software Testing :) If You don't read the first part or want to know why am I qualified to write about testing, check that post: https://read.cash/@Otek/software-testing-part-01-it-is-worth-becoming-a-software-tester-2f2c0bc9.

This time I want to cover one of the software/hardware errors called: race condition - on the example of one of the most tragic software errors in history: Therac-25 case.

Therac-25

Therac-25 was a machine for radiotherapy of cancer used in the 1980s produced by Atomic Energy of Canada Limited (AECL). It was the successor of Therac-6 and Therac-20 versions (those two were produced in cooperation with the French company CGR).

Between 1985 and 1987 there were at least six cases when patients were given massive overdoses of radiation. As a result of that 5 people dies. What happened?

AECL denies


The first accident happened in 1985 - the patient lost her breast and feeling in her hand, it turned out that the machine administered about 100 times more radiation than should. However - AECL said that is impossible to fault is on the machine side, so no action was taken.

In the same year another Therac-25 got error and give much more radiation than was ordered by the machine operator - in result 3 months later, the patient who participated in the procedure died due to complications of irradiation.

The AECL took a long time to deny the guilt, recognizing that there was no possibility that Therac-25 would get the doses wrong or irradiated despite the contrary. Nevertheless, a few more people were burned, and the case was brought to court.

Investigation

What was the result of the court investigation?

The main issue was race condition error also known as a race hazard. It happens when some steps should have proceeded in a particular order and, for some reason, that logic order was changed.

Let's take a simple example of refueling a car. We got 3 simple steps:

  1. Open fuel flap

  2. Pour fuel into the tank

  3. Close fuel flap

What will happen when You will try to perform 3rd step before finished 2? You will break your fuel flap or spill fuel around the car. That can occur when it will be set up that step 3 will start after 60 seconds after step 2 started - but if for some reason step 2 will take longer, step 3 will start anyway even when step 2 isn't finished.

Something similar happens in Therac-25 case. Machine operators were so experienced in operating this device that clicking and set up the next steps before earlier was saved by the machine. We can say that in some cases a human was faster than a machine.

Why did it happen?

Of course errors in software isn't something new, but why it wasn't found before Therac-25 was launched in hospitals?

Researchers who investigated the accidents found that code was poor quality and AECL make some really bad decisions and doesn't much care about quality:

  • AECL had never tested the Therac-25 with the software and hardware combination until it was put together at the hospital.

  • Machine worth over $ 1 million software was written by one person.

  • The hardware provided no way for the software to verify that sensors were working correctly.

  • AECL did not evaluate the software design while evaluating how the machine may deliver the expected results and what failure modes might exist, instead focusing solely on the hardware and claiming that the software was bug-free.

That are just a few examples of how bad was a AECL approach to Quality control. It also in a sad way shows how important Software testing is.

Hope that example of race condition error was interesting :) See you in next episodes :)

7
$ 10.06
$ 9.64 from @TheRandomRewarder
$ 0.27 from @Telesfor
$ 0.10 from @Crackers
+ 1
Sponsors of Otek
empty
empty
empty
Avatar for Otek
Written by
3 years ago

Comments