MSL Sol-200 Anomaly - we nearly lost Curiousity

The Rocketry Forum

Help Support The Rocketry Forum:

This site may earn a commission from merchant affiliate links, including eBay, Amazon, and others.

Winston

Lorenzo von Matterhorn
Joined
Jan 31, 2009
Messages
9,560
Reaction score
1,748
MUCH more serious of a problem than I had previously thought based upon NASA's claims at the time. The nail-biting severity was revealed to me in a recently aired Mars documentary. The text below does not mention the most nail-biting part of this - they weren't at all sure that Curiosity could even be commanded manually to switch computers due to the fault and the response from it indicating that it had switched came 3.5 minutes later than it normally would have.

https://llis.nasa.gov/lesson/11201

Excerpt:

Abstract

Six months after landing on Mars, uncorrectable errors in the NAND flash memory led to an inability of the Mars Science Laboratory (MSL) prime computer to turn off for its normal recharge session. Ground controllers commanded a swap to the backup computer, leaving the MSL rover with single-string avionics of questionable reliability prior to a Mars solar conjunction. Recovery from the anomaly was possible because of system and mission design features; nine recommendations are provided to mitigate the risk to future missions.

Driving Event

MSL has two lithium ion batteries that are recharged several times per day. These batteries enable the Curiosity rover’s power subsystem to meet the peak power demands of Rover activities when the demand temporarily exceeds the onboard multi-mission radioisotope thermoelectric generator (MMRTG) steady output level of ~100 watts. The flight computers (labeled “RCEs” in Figure 1) are always shut down prior to these recharge cycles.

Six months after landing on Mars (i.e., Mars sol-200), telemetry reported uncorrectable errors in the NAND flash memory (Reference (1)). Analysis revealed that several flight software (FSW) tasks had hung up, leading to an inability of the prime computer (Rover Compute Element (RCE) ‘A’) to turn off for its normal recharge session. Normally, fault protection would intervene: a watchdog timer would count down to zero and trigger a computer reboot. Instead, the watchdog timers were being reset. This could inexorably lead to a power brown-out of the Rover in three to six days: such loss of commandability that leaves the Rover discharging is a potentially mission-catastrophic event.

Within 16 hours of the initial error message, mission controllers at the NASA/Caltech Jet Propulsion Laboratory (JPL) bypassed the FSW and commanded a swap from RCE-A to the backup computer (RCE-B). Tones (i.e., “signals”) were subsequently received from the Rover confirming that the “backup” string had become “prime” and had entered safe mode. Information that the new prime computer gathered from the failed computer indicated that errors in the FSW had exacerbated a hardware fault in the flash memory. The MSL Flight Team faced a situation where the Rover was effectively left with “single string” avionics only 35 days prior to the Mars solar conjunction (when the spacecraft would not be commandable for 25 sols). Also, since the same FSW and flash memory were present in the new prime computer (RCE-B), the remaining string was of questionable reliability.

Failure investigation indicated that a single chip in the flash memory array was generating errors during erase cycles, likely due to a connectivity problem on the circuit board, or due to infant mortality of the commercial part. (Pre-flight testing had only erased the NAND ~12 times, as compared to an additional 38 erases (Reference (2)) after launch. The NAND part should have a life of 100,000 cycles.) Spacecraft functionality was recovered by segregating the bad flash memory; a direct hardware reset then rebooted RCE-A to operate with a half-size flash file system. (Because the data storage volume was sized with substantial margin, the loss of half the memory does not impact the mission.) Also, an additional (maximum up-time) watchdog timer was added to the flight software to strengthen fault protection. However, JPL would have been unable to diagnose the problem were it not for an avionics architecture that allowed the non-prime computer to be powered and providing telemetry on its “health” even when the (RAM-based) FSW was not running on it.
 
Last edited:
I look at some of the known faults that have occurred in various probes and I sometimes wonder if the folks who program and test these devices are competent.
 
Can one be competent without experience?

Guess that depends on the use of the word competent for the given situation's "experience".

One can be the best Airplane Test Pilot in the world. Drop him/her into a craft like they have never flown before, say a Jet pilot in a Helicopter, or F-16 pilot into a Harrier, then any accident/mishaps that may ensue would be far more likely due to their lack of experience flying THAT kind of craft, as opposed to not being a competent pilot.

Of course in that situation, to split hairs, lack of experience in flying such a craft would imply lack of competency in flying such a craft...... even when the pilot is highly competent in other vehicles.
 
I look at some of the known faults that have occurred in various probes and I sometimes wonder if the folks who program and test these devices are competent.
These missions are so INSANELY complex that I'm absolutely amazed that they can accomplish what they do. The sky crane landing technique and the entire guided EDL sequence, the "7 minutes of terror," had me on the edge of my seat and, when successful, had me just as ebullient as the staff at JPL. I never tire of watching the landing replay.

[video=youtube;Ki_Af_o9Q9s]https://www.youtube.com/watch?v=Ki_Af_o9Q9s[/video]

[video=youtube;wnG-rFFpP8A]https://www.youtube.com/watch?v=wnG-rFFpP8A[/video]

[video=youtube;gZX5GRPnd4U]https://www.youtube.com/watch?v=gZX5GRPnd4U[/video]
 
I am not arguing that these folks are not in general very intelligent, nor am I saying space flight is in any way, shape, or form a simple thing.

Still, some things that have gone wrong amaze me. Let's think along the lines of doing calculations for orbital corrections in the wrong units.

Let's think about the original flaw in Hubble's main mirror.
 
Last edited:
These missions are so INSANELY complex that I'm absolutely amazed that they can accomplish what they do. The sky crane landing technique and the entire guided EDL sequence, the "7 minutes of terror," had me on the edge of my seat and, when successful, had me just as ebullient as the staff at JPL. I never tire of watching the landing replay....
My wife and I watched it live - streaming on the computer connected to our TV. The consensus was that they would lose signal during the descent and would not really know what happened until after it was over. (Which was true anyway due to the radio lag.) I couldn't help but record it with my iPhone by pointing it at the screen. As the replay shows, they had telemetry all the way down. When they announce that "skycrane has started" you can hear me audibly gasp since I thought it had such a small chance of succeeding. My wife and I were jumping up and down as much as the folks from NASA were.

Really one of the great successes in the American space program.


Tony
 
Back
Top