Featured

Mining cause and effect

Respiratory illness levels in the counties surrounding California’s 2018 Camp Fire (location marked) above what would have been expected if the fire hadn’t occurred. Darker reds indicate higher levels of illness, and dashed lines mark areas with low population, which leads to high uncertainty. (Map: Miruna Oprescu.)

Solving problems often starts with understanding their causes. But that’s easier said than done. Statistics might show that people living near wildfires suffer from more respiratory illnesses in summer months, but that doesn’t tell researchers why people got sick. Did the smoke, or something unrelated, such as higher pollen counts, make it harder to breathe?

Computational methods often make predictions that rely on statistics. “A lot of what machine learning normally does is find correlations,” says Miruna Oprescu,  a Ph.D. candidate at Cornell University and a Department of Energy Computational Science Graduate Fellow. Extracting causation requires careful study design beyond finding these patterns. With her advisor, Nathan Kallus, she studies how machine learning algorithms not only can improve predictions but extract causation from complex real-world data.

Oprescu decided to pursue a Ph.D. after working as a data scientist and software engineer at Microsoft Research. Finding causation in problems using machine learning algorithms had an appeal and a huge range of applications from health research to natural disasters. “I wanted my research to have an impact on society or on science,” she says.

Machine learning algorithms run into some common problems with real-world data. First, researchers can’t test every scenario, a problem that pops up often in weather research. Wind direction may drive a storm’s path, for instance, but other variables, such as landmasses and temperature gradients, come into play.

In 2024, Oprescu worked with Shinjae Yoo at Brookhaven National Laboratory on causal inference. They examined the health impacts of wildfires and air pollution on weather and atmospheric conditions. These events lend themselves naturally to these questions. “What would happen if, for example, this wind were going this direction instead of this direction?” Oprescu says. “You see one Earth, you see what happened. You don’t know what would have happened on a parallel Earth.”

She developed a method for spatial-temporal causal interference — understanding the cause of events with data that varies over space and time — and showed that 2018 wildfires in California led to an increase in respiratory illnesses in people living nearby. Oprescu presented this work at NeurIPS’25 in December.

Limited data is also a problem in medical studies. In a clinical trial, it would be ideal if researchers could study how a single person was affected by both medicine A and medicine B. However, these two scenarios cannot be tested simultaneously, so instead researchers develop test groups, randomizing test subjects in group A versus group B. This strategy aims to eliminate the effect of other possible variables, such as differences in economic status, environment or stress.

“A randomized control trial is the gold standard here,” Oprescu explains. “So now I have the same kind of population in both settings.”

Working with human populations also can introduce situations that make the data more complicated to interpret. One situation that comes up is noncompliance. For example, in a randomized clinical trial to study how well drug A treats headaches compared with drug B, participants might not take their drugs regularly or as prescribed. This noncompliance can interfere with extracting useful results. “There’s a lot of these hidden biases that you have to account for,” Oprescu says.

If one drug works exceptionally well or another leads to severe side effects, trials should be stopped early. But researchers want to do so in a way that does not invalidate a trial’s results.

To overcome these challenges, Oprescu uses adaptive experimentation, a strategy that allows researchers to adjust their study design so that they have sufficient data to learn something useful. As the experiment goes on, researchers may find an area of noncompliance.

In the headache study example of drug A versus drug B, let’s say fewer participants randomized in the drug A group follow through on taking that medication than those in group B. Researchers might not have enough data to understand which drug works better. So the adaptive experimentation algorithm would suggest ways clinicians could improve the study design in later rounds, Oprescu says. In this hypothetical scenario, they might need to recommend that more patients, perhaps 60 percent rather than 50, be asked to take drug A to ensure they have sufficient data to understand the drugs’ efficacy.

Oprescu worked on this project with Kallus and Brian Cho and also presented these findings at NeurIPS‘25, a result that she says received “a lot of interest from both academia, the medical community, and tech companies.”

Elizabeth Fernandez

Share
Published by
Elizabeth Fernandez

Recent Posts

Machine-learning atoms

A University of Alabama fellow shows that AI models learn to simulate atomic interactions. Read More

January, 2026

Alien oceans

An MIT fellow follows a path from pure mathematics to planetary pursuits. Read More

December, 2025

Seawater synergy

A University of Washington fellow learns about oceans, combining observation with large-scale simulations. Read More

November, 2025

Processors and plasmas

A fellowship alumnus helps himself and others to research on Argonne's Aurora supercomputer. Read More

October, 2025

Sidebar: Science on stage

Madeleine Kerr was a double major in physics and theater at Harvey Mudd College and… Read More

September, 2025

‘Earth’s evil twin’

A UCSD fellow’s geodynamic model offers answers to stubborn questions about Venus’ surface. Read More

September, 2025