Categories: FeaturedLos Alamos

Bits of corruption

The Q supercomputer at Los Alamos National Laboratory. Q was once the world’s second-fastest supercomputer – and tested for “fatal soft errors,” or memory issues that led to node crashes.

“Silent data corruption,” or SDC, sounds like a menacing malady. SDC, Los Alamos National Laboratory (LANL) reports, “occurs when a computing system produces incorrect result(s) without issuing a warning or error message.” Root causes can include temperature and voltage fluctuations, manufacturing residues, oxide breakdowns, electrostatic discharges and even subatomic particles such as cosmic rays.

Could key supercomputers at LANL and other Department of Energy national laboratories be afflicted with SDC? That question prompted a five-year LANL study, which indeed found some evidence for SDC. But, fortunately, not much.

SDC triggers none of the usual alarms or crashes, so it’s potentially the ultimate thief in the night for computer users. Certain circumstances can cause bits to change from zeros to ones, or from ones to zeros, a process known as “flipping.” If undetected, flipping can wreck a computation.

Sarah Michalak, a staff member with LANL’s Statistical Science Group and lead author of a report on the SDC study, says SDC isn’t just a supercomputer problem but “has the potential to affect all computing resources, including cellphones, tablets, laptops, desktops, large-scale clusters and storage systems.”

Often, she says, SDC is “imperceptible or merely a nuisance if it is detected. For example, in graphical applications such as images, games, or movies a single incorrect pixel in an image is likely imperceptible to the viewer.”

But “SDC is of particular concern for high-performance computing platforms used for scientific computation,” she wrote in the report released at last year’s SC14 supercomputing conference. The study involved 12 different LANL high-performance computing (HPC) platforms.

Bearing massive numbers of computing nodes and complicated architectures, HPC platforms at national laboratories “may be used to support policy decisions,” Michalak says. “Thus, there is the possibility of a scientifically plausible incorrect result leading to a policy decision that is different from the one that would be made given a correct result.”

As far back as 1976, when Los Alamos received a six-month trial of a Cray 1 – an early supercomputer that was primitive by today’s standards – careful surveillance revealed a total of 152 bit flips in that device’s memory unit, a LANL report says.

SDC surfaced in the trade press in 1994 after a professor at Virginia’s Lynchburg College discovered incorrect data emanating from the Intel P5 processor’s floating-point unit.

Researchers have also detected it in a number of other settings, including a NASA long-term global observation satellite system, an Amazon cloud data storage system, and at CERN, the European Organization for Nuclear Research, which uses accelerators and massive data collection to probe the universe’s fundamental structure.

Study on Q

The first study Michalak contributed to in this area was a 2005 investigation of “higher-than-expected” numbers of single-node failures during early operations of the Q supercomputer, then rated as the world’s second fastest with a calculation rate of nearly 14 trillion operations per second.

Located at LANL and shared with Lawrence Livermore and Sandia national laboratories, Q was installed to help support stewardship of the nation’s nuclear weapons stockpile as advanced HPC simulations replaced live tests.

Q’s node failures were not silent but instead blamed on “fatal soft errors” – bit flips in its memory system that led to node crashes.

In a separate paper, Michalak and coauthors from Los Alamos and Hewlett-Packard – the system vendor – explained how large calculations in Q were divided up and routed to different processors. “When a single node fails, the entire job must be restarted from its last known state,” they reported. “Consequently, single-node failures can increase the runtimes of large calculations.”

The LANL-HP team localized failures to a memory cache – a component that stores data for quick processing – on the motherboard. The bit flips were detected as parity errors, meaning that sums of digits ended up odd when they should be even or vice versa. The investigators hypothesized that the errors were most likely caused by cosmic ray-induced neutrons. Exposure of Q’s circuitry to neutron beams generated by Los Alamos’ ICE House buttressed their hunch. (ICE stands for “irradiation of chips electronics.”)

In this case, elevation was a factor, because Q is housed at about 7,500 feet, “where the cosmic-ray-induced neutrons that can lead to soft errors are roughly 6.4 times more prevalent compared to at sea level,” their paper said. Those results allowed managers to pre-plan for crash rates from fatal soft errors in Q and its successor supercomputers.

Michalak says Los Alamos was the right place to look for SDC. “LANL is among the few institutions that have both the large-scale systems required for field testing and the willingness to publicly share the resulting data to raise awareness of issues like SDC.”

So a large-scale search was performed between October 2007 and October 2012 by a team that included two Los Alamos computational scientists, two computer engineers, and Michalak, who served as lead investigator and statistician.

They reviewed more than a thousand node-years of computations and roughly 264 petabytes of data transfers performed by a dozen active or decommissioned LANL HPC platforms. At the end, they found a total of 2,541 SDC-type errors, almost all occurring repeatedly on a small number of nodes.

“A very low rate of incorrect results was detected by the test codes we ran on Los Alamos supercomputers,” Michalak says. “Similar testing would likely reveal correspondingly low rates on comparable systems elsewhere.”

But 2,541 was not zero. “There is strong awareness of the potential for SDC in the large-scale scientific computing community,” Michalak notes. “So most people performing such computations know they need to consider their results carefully.”