Variable pursuits

DNA and the patterns in genetic data. *Illustration: The Digital Artist via Pixabay.*

Separating signal from noise is essential across disciplines. This process helps researchers determine if atmospheric data patterns are real, if fraudulent activity has occurred in banking or if a new drug has succeeded in a clinical trial.

Statisticians call this variable selection: identifying which variables, or features, are most important when correlated with certain outcomes. As Ashlynn Crisp, a mathematical sciences Ph.D. student at Portland State University, has learned, there’s no universally agreed-upon method for doing so.

Crisp, now a Department of Energy Computational Sciences Graduate Fellow (DOE CSGF), first encountered this lack of consensus as a community college student. The issue resurfaced two years later in a senior-level statistics course. Her professor told the class that once a dataset exceeds twenty variables, researchers struggle to determine which ones are most strongly linked with the outcome. Thirty variables would create over a billion potential variable combinations. The professor’s explanation suggested that’s just the way it is, but the conundrum stayed with Crisp.

As she considered graduate research, Crisp leaned toward foundational projects that could be meaningful in many fields. “Variable selection is such a wide-reaching problem. Anyone working in data science, machine learning or statistics runs into it.”

Crisp quickly learned that researchers have many approaches for selecting variables, and the methods’ effectiveness depends on context. “Some methods work better than others, but you don’t know when they’re applicable,” she says. This uncertainty can prevent researchers from identifying key variables or lead them to misjudge variables’ importance.

Existing methods rely on techniques such as Markov chain Monte Carlo (MCMC), which uses iterative, partially random steps to identify variables of interest. Though powerful, MCMC can become computationally prohibitive as variable numbers grow. Faster alternatives often sacrifice their ability to quantify certainty in a variable’s significance.

Crisp simulates datasets of 200 to 2,000 variables.
Author Name

Working with her advisor Daniel Taylor-Rodriguez, Crisp is developing a new approach that uses a Bayesian framework designed to preserve uncertainty quantification while avoiding MCMC’s long, unpredictable run times. Her method was inspired by a small, easily overlooked proof that she found in a 2011 statistics paper.

Performing variable selection in the Bayesian framework currently comes with a high computational cost. With MCMC, these algorithms will evaluate variables with probabilities low enough to be ignored. Crisp’s method uses a data augmentation technique to identify the important variables and avoid squandering computational power on those that ultimately don’t need to be considered.

To test her method, Crisp simulates datasets of 200 to 2,000 variables that cover the typical range of scenarios for which researchers now use MCMC. The simulations offer insight into scenarios where MCMC evaluates too many or too few models and can help arrive at the minimum number of models necessary.

Crisp hopes soon to apply her methodology to Alzheimer’s disease. Although many genes have been linked to this type of dementia, researchers still don’t know how each contributes to its progression. “My method would allow researchers to get answers on what genes to look into further. It can hopefully do so quickly, more accurately and with better uncertainty quantification than current methods can give them.”

During her 2024 Lawrence Livermore National Laboratory practicum, Crisp saw an opportunity to expand her horizons. “Quantum is getting a lot of momentum right now, and I see opportunities there for statistical problems.”

She adapted the statistical methods she was developing for her PSU research to run on quantum hardware. During the limited timeframe, she couldn’t get the methods to run successfully, but the experience gave her an idea of the problems that could be solved on quantum computers and taught her to communicate with interdisciplinary teams. She “left with more appreciation for how science in a bigger community works and how everyone comes at their disciplines a little bit differently.”

She hopes those lessons will help her follow her enthusiasm for bringing her methods to other fields. “I don’t really have one application that I’ve fallen in love with yet. One of the things that’s drawn me to statistics is that it’s a tool for everyone in every discipline to use.”

About the Author

Wudan Yan is a Seattle-based freelance science writer.

About the Author

Share This Article

DEIXIS Newsletter