Categories: FeaturedSandia

Multitalented metric

Sandia National Laboratories’ (SNL) Sky Bridge, a Cray CS300-LC supercomputer that was ranked No. 161 on the latest Top500 list. (Photo: SNL.)

Winning a marathon requires athletic ability. The same applies to a triathlon, but that event also rewards versatility.

Since the early 1990s, the supercomputing industry’s standard performance metric has been akin to tracking a marathoner’s performance. Used in the TOP500 computing systems rankings, that standard – the High Performance Linpack (HPL) – tallies floating-point operations per second (flops), or how fast a computer can use a particular method to solve millions of equations and report the results.

In recent years, many high-performance computing (HPC) researchers have become concerned that HPL’s design overwhelmingly favors computers that can do lots of floating point operations. In fact, some computer builders may overprovision their machines with processors or create an architecture to get a good HPL performance rating without necessarily having the memory system performance to support other kinds of computation.

Computer systems must orchestrate data movement from the system’s memory to the processor and back to address a different, broad set of science and engineering applications, including modeling and simulating such things as automobile crashes, aerodynamics and oil recovery. These efforts all are based on solutions of often massive numbers of differential equations.

In fewer than three years, a new computing metric has emerged to broaden the performance rubric and capture these nuances.

The HPL metric doesn’t heavily stress data movement and memory system performance, says Michael Heroux of Sandia National Laboratories’ Center for Computing Research. Heroux, working with Jack Dongarra (an original Linpack author) and Piotr Luszczek from the University of Tennessee, has developed a supercomputing benchmark based on a teaching code he developed about 10 years ago for his students at St. John’s University in Minnesota. The High Performance Conjugate Gradients benchmark (HPCG) executes an algorithm distinct from the one HPL uses. Heroux wrote the code to show students how parallel programming divides problems and distributes the parts to individual processors, reducing the time to solution. It evolved to become a proxy test code for much larger programs and eventually became a novel benchmark that has since run on 63 HPC systems worldwide.

“If you have two computers that both can compute at a 10 petaflops per second rate,” or 10 quadrillion flops per second, Heroux says, “and one had a memory system that was twice as fast as the other one, HPL wouldn’t necessarily show that better performance because it doesn’t really concern itself with memory system performance.

“But HPCG would show that the machine with twice the memory system would also be twice as fast. HPCG is also sensitive to emerging types of concurrency on modern parallel processors. We’re running different kinds of computations, ones that are more sensitive to how the memory system performs. HPCG gives a nod to the tri-athlete over the marathon runner.”

Like HPL, HPCG solves a large system of linear equations. But where HPL uses Gaussian elimination, requiring a fixed number of steps to solve, HPCG uses an iterative approach that starts with an initial guess to the problem’s answer.

“For certain matrices, HPCG represents an algorithm that guarantees your next guess after the first one is closer to the right answer than the one you started with,” Heroux says. “Each iteration gets you closer to the right answer.”

It all happens in seconds and the number of iterations HPCG needs varies with the problem’s dimension, or size. But the algorithm can solve a huge system of equations – up to a billion – in just tens of iterations.

Benchmarks need to address two general problem patterns, which Heroux calls Type 1 and Type 2. In Type 1, which HPL does exclusively, the benchmark relies on measuring the rate of multiplying two dense data arrays with each other. Type 2 is far more dependent on how fast data are moved from memory to the processor. HPCG addresses both types.

“What we’ve come up with doesn’t replace HPL but complements it,” he says. “HPL can still be a good indicator for certain systems’ performance in solving certain problems – in materials science, for instance. It is a relevant metric in some areas. I like to think of them as bookends on a performance spectrum.”

Before joining Sandia, Heroux worked for Cray Research from 1988 to 1998, and he has maintained many supercomputing associations over the years. A key part of the HPCG project is to preserve such community-vendor relationships. He has collaborated with Intel, IBM and other companies that have optimized versions of the code.

HPCG is now at version 3.0 and should stay that way for a long time, Heroux says.

“Starting with HPCG 1.0 just two-and-a-half years ago, we kept adding features based on excellent community feedback,” he adds. Version 2.4 was the first “where we started producing a list of machines that ran it and posted their results. We hope to run a lot more in time for the International Supercomputing Conference this June in Frankfurt, Germany.”

Tony Fitzpatrick