I grew up in Wuhan, split by the Yangtze River, and earned my PhD in Pittsburgh, where the Allegheny and Monongahela meet to form the Ohio. In both cities, bridges shape daily life. They do more than link two banks; they keep people, goods, and information moving.

I see my work in biostatistics in the same way. Modern biomedicine has two shores. On one side are statistics and artificial intelligence, where we develop models and algorithms. On the other side is clinical medicine, where clinicians must act for real patients under time, safety, and cost constraints. Our research builds the bridge between them. We turn complex data into tools that clinicians can understand, test, and use.

At the center of this work is an idea I call precision medicine for all. Precision medicine should not serve only a narrow group of patients who are easy to study or well represented in trials. A bridge that carries only one kind of traffic has failed; in the same way, a model that works only in one ancestry group, one hospital, or one dataset does not meet our standard. We design methods that travel across populations, centers, and platforms and still perform in a predictable way.

Across projects, we keep a simple checklist: state assumptions clearly, handle batch effects and confounding, measure uncertainty, and test generalization. We release our work as reproducible software. In this way, the bridge from data to decision is visible and can be inspected and improved by others.


Research Interests

The “Cell-Line-Like-Me” Framework

PLoS Comput Biol 2024; Front Genet 2022

Many cancer drugs fail because the preclinical models that support them do not reflect the tumors seen in patients. Cell lines, organoids, and animal models each capture only parts of human disease, and mismatches at this stage can propagate to late-stage trials. We frame this translational gap as a problem of statistical congruence between models and human tumors across species, platforms, and studies.

We develop three parallel lines of work to address this. Direct harmonization quantifies global molecular similarity and basic compatibility between models and tumors. Latent factor and mixture models recover shared biological programs—such as pathways or regulatory modules—and match models and patients at that level rather than gene by gene. Flexible representation learning, including deep learning, aligns feature spaces while absorbing technical and species differences. Together, these approaches define “cell-line-like-me” avatars that guide model selection, drug prioritization, and study design.

Machine Learning for Tumor Metastasis

Bioinformatics 2024

Most cancer deaths are caused by metastasis rather than the primary tumor. Metastasis reflects a chain of events involving tumor cells, microenvironment, and host response, which no single dataset measures fully. Bulk and single-cell studies often give partial and sometimes conflicting pictures. The central challenge is to extract stable metastatic programs that hold across cohorts and platforms.

We organize this work into three directions. First, we build multimodal risk prediction models that combine multi-omics and clinical data to stratify metastatic risk in a way that can move across cohorts and centers. Second, we study intra-tumor heterogeneity and the tumor microenvironment, including metabolite-constrained modules and cell–cell interactions that may drive metastatic escape and colonization. Third, we develop quantitative models of the metastatic cascade, linking risk scores and molecular modules to stage-to-stage transitions and potential intervention points. Together, these efforts aim to turn fragmented data into coherent, clinically useful descriptions of metastasis.

Statistical Modeling of Metabolic Syndromes

Metabolic syndromes are major contributors to cardiovascular disease, but in practice they are often handled as a simple yes/no label. We instead view metabolic health as the state of a connected biological system that changes over time, shaped by interacting organs, pathways, and environments.

Our work follows three directions. First, we use large cohorts and routine clinical indices to evaluate how different definitions and combinations of metabolic abnormalities predict long-term outcomes. Second, we integrate multi-omics data with network analysis to identify molecular subtypes and modules that underlie distinct clinical patterns and risks. Third, we build dynamic models of metabolic trajectories, describing how people move between states and exploring when and how interventions might reverse or stabilize these paths. Together, these directions aim to move from static labeling to time- and system-aware management of metabolic health.


Awards