Winston
Lorenzo von Matterhorn
- Joined
- Jan 31, 2009
- Messages
- 9,559
- Reaction score
- 1,732
MUCH more serious of a problem than I had previously thought based upon NASA's claims at the time. The nail-biting severity was revealed to me in a recently aired Mars documentary. The text below does not mention the most nail-biting part of this - they weren't at all sure that Curiosity could even be commanded manually to switch computers due to the fault and the response from it indicating that it had switched came 3.5 minutes later than it normally would have.
https://llis.nasa.gov/lesson/11201
Excerpt:
Abstract
Six months after landing on Mars, uncorrectable errors in the NAND flash memory led to an inability of the Mars Science Laboratory (MSL) prime computer to turn off for its normal recharge session. Ground controllers commanded a swap to the backup computer, leaving the MSL rover with single-string avionics of questionable reliability prior to a Mars solar conjunction. Recovery from the anomaly was possible because of system and mission design features; nine recommendations are provided to mitigate the risk to future missions.
Driving Event
MSL has two lithium ion batteries that are recharged several times per day. These batteries enable the Curiosity rover’s power subsystem to meet the peak power demands of Rover activities when the demand temporarily exceeds the onboard multi-mission radioisotope thermoelectric generator (MMRTG) steady output level of ~100 watts. The flight computers (labeled “RCEs” in Figure 1) are always shut down prior to these recharge cycles.
Six months after landing on Mars (i.e., Mars sol-200), telemetry reported uncorrectable errors in the NAND flash memory (Reference (1)). Analysis revealed that several flight software (FSW) tasks had hung up, leading to an inability of the prime computer (Rover Compute Element (RCE) ‘A’
to turn off for its normal recharge session. Normally, fault protection would intervene: a watchdog timer would count down to zero and trigger a computer reboot. Instead, the watchdog timers were being reset. This could inexorably lead to a power brown-out of the Rover in three to six days: such loss of commandability that leaves the Rover discharging is a potentially mission-catastrophic event.
Within 16 hours of the initial error message, mission controllers at the NASA/Caltech Jet Propulsion Laboratory (JPL) bypassed the FSW and commanded a swap from RCE-A to the backup computer (RCE-B). Tones (i.e., “signals”
were subsequently received from the Rover confirming that the “backup” string had become “prime” and had entered safe mode. Information that the new prime computer gathered from the failed computer indicated that errors in the FSW had exacerbated a hardware fault in the flash memory. The MSL Flight Team faced a situation where the Rover was effectively left with “single string” avionics only 35 days prior to the Mars solar conjunction (when the spacecraft would not be commandable for 25 sols). Also, since the same FSW and flash memory were present in the new prime computer (RCE-B), the remaining string was of questionable reliability.
Failure investigation indicated that a single chip in the flash memory array was generating errors during erase cycles, likely due to a connectivity problem on the circuit board, or due to infant mortality of the commercial part. (Pre-flight testing had only erased the NAND ~12 times, as compared to an additional 38 erases (Reference (2)) after launch. The NAND part should have a life of 100,000 cycles.) Spacecraft functionality was recovered by segregating the bad flash memory; a direct hardware reset then rebooted RCE-A to operate with a half-size flash file system. (Because the data storage volume was sized with substantial margin, the loss of half the memory does not impact the mission.) Also, an additional (maximum up-time) watchdog timer was added to the flight software to strengthen fault protection. However, JPL would have been unable to diagnose the problem were it not for an avionics architecture that allowed the non-prime computer to be powered and providing telemetry on its “health” even when the (RAM-based) FSW was not running on it.
https://llis.nasa.gov/lesson/11201
Excerpt:
Abstract
Six months after landing on Mars, uncorrectable errors in the NAND flash memory led to an inability of the Mars Science Laboratory (MSL) prime computer to turn off for its normal recharge session. Ground controllers commanded a swap to the backup computer, leaving the MSL rover with single-string avionics of questionable reliability prior to a Mars solar conjunction. Recovery from the anomaly was possible because of system and mission design features; nine recommendations are provided to mitigate the risk to future missions.
Driving Event
MSL has two lithium ion batteries that are recharged several times per day. These batteries enable the Curiosity rover’s power subsystem to meet the peak power demands of Rover activities when the demand temporarily exceeds the onboard multi-mission radioisotope thermoelectric generator (MMRTG) steady output level of ~100 watts. The flight computers (labeled “RCEs” in Figure 1) are always shut down prior to these recharge cycles.
Six months after landing on Mars (i.e., Mars sol-200), telemetry reported uncorrectable errors in the NAND flash memory (Reference (1)). Analysis revealed that several flight software (FSW) tasks had hung up, leading to an inability of the prime computer (Rover Compute Element (RCE) ‘A’
Within 16 hours of the initial error message, mission controllers at the NASA/Caltech Jet Propulsion Laboratory (JPL) bypassed the FSW and commanded a swap from RCE-A to the backup computer (RCE-B). Tones (i.e., “signals”
Failure investigation indicated that a single chip in the flash memory array was generating errors during erase cycles, likely due to a connectivity problem on the circuit board, or due to infant mortality of the commercial part. (Pre-flight testing had only erased the NAND ~12 times, as compared to an additional 38 erases (Reference (2)) after launch. The NAND part should have a life of 100,000 cycles.) Spacecraft functionality was recovered by segregating the bad flash memory; a direct hardware reset then rebooted RCE-A to operate with a half-size flash file system. (Because the data storage volume was sized with substantial margin, the loss of half the memory does not impact the mission.) Also, an additional (maximum up-time) watchdog timer was added to the flight software to strengthen fault protection. However, JPL would have been unable to diagnose the problem were it not for an avionics architecture that allowed the non-prime computer to be powered and providing telemetry on its “health” even when the (RAM-based) FSW was not running on it.
Last edited: