Keeping AI honest

Incorporating scientific constraints (solid lines) can generate datasets that don’t provide a full mathematical distribution (left). By perturbing the distribution (center), model training is easier. After training, researchers can recover the initial distribution (right). Illustration: Katie Keegan.

From astrophysics to drug discovery, generative AI has enormous scientific potential. But there’s a caveat: Scientists must have confidence in their generative AI results.

“The moment that you want to use these AI systems for scientific purposes, you have to take a step back,” says Katie Keegan, an Emory University computational mathematics Ph.D. student. For generative AI to be useful, she says, scientists must trust the math behind it.

Most commercial generative AI models — like ChatGPT for language and DALL-E for images — are trained on vast amounts of real-world data and then use probability to infer patterns in the data. When a user prompts the model with a question, it generates content based on statistical predictions alone.

For science, however, researchers often combine training data with further constraints — for example, the chemical laws that govern the bond angles between atoms in a molecule.

Adding such qualifiers “changes the size of the training data distribution” in unexpected ways, Keegan says. Probabilities, for instance, generate a bell curve of possibilities; adding scientific laws carves out parts of the bell curve, sometimes leaving just a small fraction of its original range.

“It’s nonintuitive,” says Keegan, a Department of Energy Computational Science Graduate Fellowship (DOE CSGF) recipient. “It took me several months to mathematically wrap my head around it.”

The unrealistically paper-thin distribution of the training data in turn skews the distribution of the generative AI results. This is a fundamental Achilles heel because scientists want to generate results that are consistent with the scope of possible experimental, real-world results.

To solve this constraint problem, Keegan has turned it on its head by adding further mathematical methods that perturb, or shift, the training data’s distribution but in specific directions informed by the constraint.

‘The generated distribution is faithful to that of the training data.’
Author Name

With her Ph.D. thesis advisor at Emory University, applied mathematician Lars Ruthotto, she “perturbed the original data distribution in a mathematically constraint-aware way.” The method “thickens its distribution and makes it easier for the generative AI model to learn, and when we’ve done all our training and sampling, we undo the perturbation to make the data scientifically realistic again,” says Keegan. Her research paper on the work has been accepted as a spotlight paper, one of the top 2 percent, in July for the 2026 International Conference on Machine Learning.

Keegan tested her technique with both theoretical analyses and computational experiments run with open-source image and protein data sets. “We found that perturbing the data in a constraint-aware way addresses both the sample constraint adherence and distribution quality,” she says. Her method offers “mathematical guarantees” that existing methods don’t provide — that the generated distribution is faithful to that of the training data.

Keegan’s research draws on a deep passion for mathematics. “I just fell in love with math” in my preteen years, says Keegan. At 14, she was admitted to Mary Baldwin University, one of the few American colleges that accepts precocious learners, earning a bachelor’s degree in applied math.

Her AI work builds on her earlier research with Elizabeth Newman, who is now at Tufts University. In that work Keegan developed ways to mathematically optimize the computational analysis of data from 3D to even higher dimensions, such as hyperspectral satellite imagery. To analyze such complex data computationally, researchers use a mathematical tool called a tensor singular value decomposition (tensor SVD). Though highly effective, tensor SVD algorithms significantly increase the costs of storing data inhigh-performance computing memory.

Keegan helped demonstrate that it was possible to finesse the core algorithm in Newman’s novel tensor SVD and “dramatically reduce memory usage without losing performance,” she says.

She tackled another memory-cost challenge during her 2024 practicum at Lawrence Berkeley National Laboratory. With Aydın Buluç, she demonstrated how to speed up the training of a graph neural network, a machine-learning tool commonly used in chemistry, biology, and social-network analysis. Instead of training the model on the whole data set at once, she used parallel sub-sets of data thus reducing computational memory costs, all while maintaining accuracy.

Keegan’s varied research has given her extensive experience in hands-on high-performance computing. “My life is just logging into a DOE supercomputer every day,” she says. She notes that the research relies heavily on her DOE CSGF allocation for the Perlmutter supercomputer at Berkeley Lab’s National Energy Research Scientific Computing Center.

As she works toward completing her Ph.D. in 2027, Keegan plans on “incorporating more complicated kinds of constraints” into scientific generative AI models, including the partial differential equations that are central to many DOE physics simulations. “That would be really interesting.”

About the Author

Jacob Berkowitz is a science writer and author. His latest book is The Stardust Revolution: The New Story of Our Origin in the Stars.

About the Author

Share This Article

DEIXIS Newsletter