Krishnamoorthy and his team worked on porting or optimizing the codes to run on Jaguar, Oak Ridge National Laboratory’s Cray XT5 system. Now the PNNL scientists are doing something similar on Jaguar’s successor, Titan.

Recently, the research team demonstrated that the most computationally intensive portion of the calculation can be run on 210,000 processor cores of Titan, a Cray XK7 at Oak Ridge’s Leadership Computing Facility, achieving more than 80 percent parallel efficiency.

“When I joined PNNL we were still looking at codes that run on a few thousand cores,” Krishnamoorthy says. When Chinook, the lab’s computational chemistry machine, arrived, “we were immediately jumping up to 18,000 cores.” Doing one calculation at a time and constantly shuttling data to disk drives “was not going to suffice.”

Tackling and solving the jump to highly parallel computing and innovating how the system deals with workload and faults earned Krishnamoorthy a DOE Early Career Research Program award. It grants him $2.5 million over five years to explore ways he can extend his ideas to exascale computing. Shortly thereafter, Krishnamoorthy learned he had also been awarded PNNL’s 2013 Ronald L. Brodzinski Award for Early Career Exceptional Achievement.

He’s now broadening the methods developed for computational chemistry to apply to any algorithm that has load-imbalance issues.

“You want a dynamic environment where the execution keeps on going and the user doesn’t have to worry about statically scheduling everything – the run-time engine just automatically adapts to changes in the machine and in the problem itself,” he says.

The current framework is called Task Scheduling Library (TASCEL) for Load Balancing and Fault Tolerance.

“We are now trying to adapt this method to the new codes as they are developed,” Krishnamoorthy says. He wants moving one version of a program to the next generation to be seamless, by automating the process.

“You have to write the program in terms of collections of independent work or tasks and the relationships among them in terms of who depends on who and what data they access,” he says. “As long as it is written this way, the run-time can take over and do this load balancing, communication management and fault management automatically for you.”

His methods address two of the most daunting challenges facing exascale computing: load imbalance and fault tolerance. He’s contemplating not what computers will look like in the next two to three years but instead what challenges there’ll be with applications running on exascale computers eight to 10 years from now.

The cost of failure can be steep, but Krishnamoorthy is making it less so every day.

 

Page: 1 2

Share
Published by

Recent Posts

Cutting carbon, blocking blooms

Besides bioplastics research, the LANL Biofuels and Bioproducts team is studying carbon neutrality and applying… Read More

June, 2023

Planet-friendly plastics

A Los Alamos team applies machine learning to find environmentally benign plastics. Read More

June, 2023

A split nanosecond

Sandia supercomputer simulations of atomic behavior under extreme conditions advances materials modeling. Read More

May, 2023

A colorful career

Argonne’s Joe Insley combines art and computer science to build intricate images and animations from… Read More

April, 2023

Mapping the metastable

An Argonne National Laboratory group uses supercomputers to model known and mysterious atomic arrangements, revealing… Read More

March, 2023

The human factor

A PNNL team works to improve AI and machine learning tools so that grid operators… Read More

January, 2023