Super-connected HPC

NERSC has been working with scientists and staff at Linear Coherent Light Source (LCLS), its experimental halls seen here, on real-time data analysis for two LCLS experiments looking at the structure of the SARS-CoV-2 virus. (Photo: SLAC National Accelerator Laboratory.)

High-performance computing (HPC) is only as valuable as the science it produces.

To that end, a National Energy Research Scientific Computing Center (NERSC) project at Lawrence Berkeley National Laboratory has been expanding its reach through a superfacility – “an experimental facility connected to a networking facility connected to an HPC facility,” says Debbie Bard, group lead for the center’s Data Science Engagement Group and NERSC’s superfacility project.

But “simply transferring the data and then trying to run your code is not sufficient for a team to get their science done,” Bard explains. A superfacility must also provide analytical tools, databases, data portals and more. “The superfacility project is designed to address the hardware and all these other pieces of infrastructure, tools, technologies and policy decisions that must all come together to make this kind of workflow easy for a scientist.”

One of the superfacility project’s early triumphs involved work with the Linac Coherent Light Source at the SLAC National Accelerator Laboratory. “They send the data to us,” Bard says. “That data is analyzed on our compute systems and then the results are displayed on a web interface with very short turnaround – just a couple minutes. It allows the scientists to make real-time decisions about what to do next in their experiments.”

The first superfacility trial SLAC scientists ran focused on the structure of SARS-CoV-2, the virus behind the COVID-19 pandemic. “We are ever so proud of this,” Bard says. “They’re able to do really useful and important science using this infrastructure that we’ve set up.”

Today’s research pushes the capabilities of single installations, says Cory Snavely, NERSC Infrastructure Services Group lead and superfacility project deputy. “Historically, a lot of the computational science work or the instrumental and observational work was at a scale that could be done or at a complexity that could be done within the scope of one facility.” That’s not always possible with the size of today’s research projects.

“One-off, piecemeal support doesn’t scale, so we need to do something that’s coordinated,” Bard notes. That necessitates finding common ways to manage data from disparate projects. Bard and Snavely work with eight diverse science teams, and they each need different things from the superfacility model. “Their computing is different,” Bard notes. “Their science is different. Their problems are different.” Nonetheless, the superfacility project’s goal is to build one toolset that meets all the teams’ needs and more.

Snavely says such packaging is possible because similar patterns emerge across different disciplines and types of facilities. One of those patterns is large project size and data-driven science. “That implies a number of things, like the need for high-performance data transfer, petascale compute capabilities, real-time job execution and automation.”

Bard notes that the teams are “working with pretty much every kind of experiment and observational facility,” from astronomical observations to specialized detectors to genomics. Building a system that helps such diverse users depends on finding similarities. “The actual motif of their workflows can be quite similar, even if the science they’re doing is very, very different. Their needs from us have something in common.”

Now two years into this three-year project, the team sees more than ever how time plays into all of it, especially real time.

“A lot of these instruments operate on schedules,” Snavely says, which means a research team needs everything to work during its allocated time. “The team’s campaign is probably not going to just be, for example, one shot of a light source and one observation. They’ll need multiple iterations, and they’ll need to tune observational parameters.” Thus an experiment might entail dozens or hundreds of runs, all to be completed in a fixed time.

So the superfacility model must be resilient. The computational and support pieces must all be ready – nearly all of the time. That’s a lot of equipment and software to maintain.

But it will be worth it, Bard says. “Our aim is that our science engagements will be able to demonstrate automated data-analysis pipelines, taking data from a remote facility and being able to analyze them on our systems at NERSC at large scale – without routine human intervention.” She adds that “the goal of our project is to be able to demonstrate automated data analysis pipelines across facilities.”

Accomplishing the Berkeley superfacility project’s goals requires scaling. The project must automatically handle growing demand on NERSC services.

To build in the required resiliency, “both facility and system architecture improvements are needed to help keep data and compute systems more available for more of the time and to have maintenance be less disruptive,” Snavely says.

Those improvements not only increase resiliency but also provide a foundation for a range of capabilities, including data transfer, discovery and sharing. Here, the project team increased throughput and enhanced ways to manage networks through programming. Snavely describes the latter as “more flexible plumbing.”

To address automation requirements, the project’s API, or application programming interface, will let users submit jobs, check on their status, transfer data and more. The API’s purpose, Snavely says, is “to give the researchers who are writing software for their project the ability to interact with the center.”