Turbocharging data

The Scientific Computing and Data Center at Brookhaven National Laboratory. (Photo: Brookhaven National Laboratory.)

High-performance computing (HPC) and concurrent software improvements are supercharging data-processing, just as giant experimental facilities such as CERN’s Large Hadron Collider (LHC) particle accelerator and Brookhaven National Laboratory’s National Synchrotron Light Source II (NSLS-II) seek ways to sift mountains of information.

LHC’s ATLAS, which observes the highest-energy proton-proton collisions like those in the Higgs boson discovery, and imaging experiments at the NSLS-II generate “anywhere from several gigabytes to hundreds of petabytes in terms of data volume,” says Meifeng Lin, who leads the High Performance Computing Group for Brookhaven’s Computational Science Initiative (CSI).

These data typically must go through multiple processing stages to help scientists find meaningful physical results.

“It’s a perfect marriage between HPC and these big experiments – in the sense that HPC can dramatically improve the speed of data processing,” Lin says.

Today’s supercomputing systems have massively parallel architectures that simultaneously employ hundreds of thousands of multicore central processing units (CPUs), many-core CPUs and graphics processing units (GPUs), Lin says. “These architectures offer a lot of processing power, which was initially needed for large-scale simulations but is now also used” to quickly analyze large amounts of data.

But “there are many different HPC architectures, and for both CPUs and GPUs there are several levels of parallelism,” Lin says. “The complicated part is to harness the power of all of these different levels.”

GPUs offer thousands of processing units. Intricacies such as how data are stored in memory can affect their performance, Lin says.

“So far, GPUs don’t function stand-alone,” she says. “They’re attached to CPUs as accelerators. So a GPU program will launch from the CPU side and data initially get stored on CPU memory. For GPUs to process the data, you need to first transfer the data from CPU memory to GPU memory, although there are newer technologies that can circumvent this step.”

To get maximum performance, researchers must figure out how to move data – and how much to move – from CPU memory to GPUs. “Moving these data,” Lin notes, “incurs some computational overhead.” If the data transfer is large, “it requires quite a bit of time compared to the computational speed GPUs can offer.” Multiple issues “affect how you program your software and how much computational speedup you can get.”

Large-scale experiments such as ATLAS and NSLS-II have their own special data challenges that HPC can help solve.

Particle physics experiments can run a long time, accumulating loads of data. Algorithms must remove background noise from these troves so scientists can find the signals they seek. “To do this, they have a very large and complex software stack with millions of lines of code,” Lin says. “The signals the scientists want are usually very weak compared to the background noise they need to ignore. So their software undergoes a very rigorous validation and verification process, and changing it is a long and complicated process.”

In NSLS-II imaging experiments, scientists shine powerful X-rays through materials to obtain structural or physical information. “The software is relatively stand-alone and compact in size, so it’s easier for us to adapt it to make use of the HPC platform,” Lin says.

Lin cites ptychography, an X-ray microscopy technique that employs computation, as an example of how HPC can accelerate data processing. A computer program reconstructs physical sample information from X-ray scattering data gathered at the NSLS-II’s Hard X-ray Nanoprobe beamline. The code “did the job, but it took a very long time to reconstruct even one sample – up to 10 hours or days, depending on the imaging technique or algorithm used.”

Lin helped Brookhaven physicist Xiaojing Huang parallelize the computation-heavy ptychography reconstruction algorithm over multiple GPUs. “This,” Huang explains, “increases the calculation speed by more than a thousand times, which shortens the image recovery process from days to minutes. The real-time image processing capability is critical for providing online feedback and optimizing our scientific productivity.”

To achieve this speedup, Lin and her CSI and NSLS-II colleagues reprogrammed the code to make it run on multiple GPUs. It wasn’t easy. The original software was written for serial processing – running on a single CPU core.

“We want to be able to break the calculation into several independent pieces part of the time,” Lin says. “And we have to combine them so we’re solving the original big problem, not just individual pieces.”

One of the most difficult parts was finding the right model to let scientists write the GPU code without going into low-level programming descriptions. “A common programming model for GPUs is CUDA, which has a learning curve,” Lin says. “We wanted to provide the scientists with an interface that’s independent of architecture-specific programming and easy to maintain.”

The team tapped CuPy, an open-source project that lets users program GPUs with a language they’re already familiar with and use in production. The CuPy interface complies with NumPy, the core scientific computing library in the Python language, on which the code is based.

One leading CuPy contributor is Leo Fang, a CSI assistant computational scientist who maintains the group’s ptychography software. CuPy and other open-source projects, he says, build “a critical, scalable infrastructure for our scientists and their applications.” The work “benefits the entire open-source community. With this capability, we shift the bottleneck of our end-to-end workflow elsewhere.”

Lin and her colleagues are making the software portable across various HPC architectures, such as GPUs from different vendors, and to expand the user base. So far, they’ve run their code on NVIDIA and AMD GPUs as well as multicore CPUs. Fang hopes “our users can easily run the code, either on a workstation at home or in a cluster, without being tied to a specific hardware vendor. It’s cool to be able to do this with little to no refactoring in our code.”

Lin next wants to bring portable HPC solutions to particle physics experiments such as those at the LHC. Its software has special challenges. She and her colleagues are eager to take them on.

Sally Johnson

The author is science media manager at the Krell Institute.