Creating and managing massive numbers of records has turned out to be a long-standing challenge for computer file-system researchers. Handling millions or billions of files is extremely difficult, let alone trillions.
But that’s what a team of researchers from Los Alamos National Laboratory (LANL) and Carnegie Mellon University is doing, satisfying a demand for managing truly huge quantities of files. Their solution, DeltaFS, is a distributed file system for high-performance computing (HPC) that’s distributed via GitHub.
“We tried to go back to the first principles and really examine how to build a scalable metadata service,” says Brad Settlemyer, a LANL computer scientist and DeltaFS project leader. In file-system research, metadata operations are those that create files, manage their attributes and remove them.
In the era of massive data-generation, file systems focus on storage, says George Amvrosiadis, a Carnegie Mellon assistant research professor collaborating with LANL. “DeltaFS pools together the cluster’s compute and networking resources, which typically wait on storage” during calculations, “and uses them to perform work that significantly speeds-up data analysis. In practice, this helps scientists answer questions and make discoveries faster.”
DeltaFS is software-defined and transient, meaning file-system design and provisioning can be decoupled from HPC platforms – effectively enabling control of metadata performance and the ability to process 100 times more file-creation operations than other approaches.
DeltaFS differs from traditional HPC file systems in a few important ways. For starters, it doesn’t use dedicated metadata servers. DeltaFS also avoids unnecessarily synchronizing the file system by having temporary/local namespaces – which ensure that all names used within a program are unique – rather than a single global namespace, or an enterprise-wide abstraction of all file information. “By using temporary/local namespaces, DeltaFS avoids the problem of existing metadata workloads where most of the input/output (I/O) operations are very small reads and writes to metadata servers,” says Settlemyer. “This allows DeltaFS to do large bulk changes to the metadata as entire namespaces are checked in/out.”
As the team made progress building a scalable file system metadata layer, it quickly became clear they had a method “capable of handling truly massive numbers of files,” Settlemyer says. “So we began to look at some of the problems scientists at LANL were having with large amounts of data to see how we would rethink those with a truly scalable metadata service.”
When the team started working with scientists who had data problems, they realized that DeltaFS offered an opportunity to accelerate scientific analysis by searching for extremely small amounts of information within large data sets. “In technical parlance,” Settlemyer says, “we call these ‘highly selective queries,’ but the easiest analogy is to call these needle-in-the-haystack problems.”
Amvrosiadis is working with LANL to modify vector particle-in-cell (VPIC), a popular particle-in-cell simulation framework, “to provide a turnkey solution for scientists who want to benefit from DeltaFS – without understanding its intricacies,” he says. VPIC models kinetic plasmas in one, two or three spatial dimensions.
One plasma physicist, for example, wanted to use VPIC to explore trajectories for only the highest-energy particles in simulations that tracked a trillion of them. And he only would know which particles have the highest energies when the simulation ended.
“We were quickly able to reformulate this from a massive sorting and searching problem into a simple metadata problem,” Settlemyer says. “Rather than storing all of the data within a single file and doing a massive search and sort across a trillion entries in each file, we simply created a file for every particle.”
When the simulation ended, the plasma physicist knew which of the trillion particles had the highest energy, and he could simply look within a thousand or so files of interest among the trillion for specific particles. “It was approximately a hundred to a thousand times faster than the existing sorting-and-searching-based approach,” Settlemyer says.
Now that the scientists are a year into working with DeltaFS, they want to make it easier to use. Running large-scale scientific simulations requires a lot of practice and skill, and tiny configuration errors produce millions of messages simply because projects run on millions of processing cores.
“It can be difficult to figure out what went wrong when you’re inundated with status and error messages,” Settlemyer notes. “Further, every simulation is in some way a snowflake because every scientist is trying to explore a unique piece of some physical phenomenon.”
To make DeltaFS easier to use, the team is pushing a set of patches into the VPIC code repository, Settlemyer says. “We really want to make these needle-in-the-haystack type of queries –which, in VPIC’s case, is called particle tracing – straightforward and to reduce the time to discovery for scientists.”
Ramping up to extreme scale has been “shockingly difficult,” Settlemyer says, and the team found the limits of their algorithms multiple times when they tried to create and write to a trillion particle files.
They’re working on an algorithm called “the shuffle phase,” in which all of a million application processes talk to all of a million metadata servers. A basic approach to this, in which any of the application processes are allowed to talk to any of the servers, works at small scale.
“But if we allow each application process to maintain a communication buffer for one million servers, we don’t have enough memory in all of Trinity (LANL’s Cray supercomputer) – which has 2 petabytes of RAM – to support those buffers,” Settlemyer says. “And it has always been a goal of DeltaFS to use no more than 3 percent of system memory.”
To scale to a million processes, the team created a new all-to-all communication algorithm called “the three-hop,” which means each application can only send to a small number of servers. But those servers can forward the message to another server if needed.
“In this way, every application can send to every server at the cost of taking three hops to do so, and every process only needs to be able to communicate with a much smaller number of total processors,” he says. “This is a classic scalability tradeoff: We make the common case of sending a message a little slower and a little more complicated. But in return we can scale this technique up to a million processes.”
Despite a lot of foundational work showing why metadata systems were never going to manage trillions of files or operate on thousands of compute nodes, in DeltaFS Settlemyer and colleagues designed and built a scalable metadata service that can indeed achieve that scale.
“Now that an exemplar exists that shows you can create one trillion files, the storage research community is going to explore the tradeoff space available,” he says. “I look forward to the time when a team builds a file system that can create and manage a quadrillion files. It may happen a lot sooner than we all expect.”
More broadly, DeltaFS shows how to build astonishingly fast systems at the intersection of HPC and distributed-systems research, Settlemyer says. “Interaction between these research communities is a future direction with a lot of potential.”