Clustering and Association Analysis for High-Dimensional Omics Studies

Advisor & Committee Chair: George Tseng, ScD

Committee Members:

  • Wei Chen, PhD, Department of Pediatrics
  • Ying Ding, PhD, Department of Biostatistics
  • Lu Tang, PhD, Department of Biostatistics 

 

Abstract:

With the rapid advancement of high-throughput technologies, a large amount of high-dimensional omics data has been generated in the public domain, which gives rise to various statistical and computational challenges in the cluster and association analysis of omics data. This dissertation focuses on estimation of tuning parameters in cluster analysis (Chapter 2), disease subtyping issues (Chapter 3) and association study between gene expression and multiple phenotypes (Chapter 4) in high-dimensional omics studies.

In Chapter 2, we proposed a resampling framework called S4 for selecting parameters in cluster analysis. Estimating the number of clusters (K) is a critical and often difficult task in cluster analysis. Many methods have been proposed to estimate K including S4 as the best performer. Our proposed S4 method measures the similarity (i.e., stability) between the clustering result of the whole and subsampled data and determines the optimal K with the highest stability score, based on the belief that the underlying true K can have stable clustering result when the data structure is perturbed (subsampling). In clustering high-dimensional omics data, many irrelevant features exist and may interfere with detection of true cluster structure. Therefore, feature selection is often needed for improved performance and interpretation. Witten and Tibshirani (2010) proposed a sparse K-means approach with lasso regularization on feature-specific weights to tackle this problem, where number of clusters K and sparsity parameter λ   must be both pre-estimated. To the best of our knowledge, little has been studied for simultaneous estimation of these two parameters. We extend our S4 to bridge the gap and it shows superior performance based on extensive simulations and nine real applications.

In Chapter 3, we proposed a novel outcome-guided disease subtyping framework with weighted joint likelihood approach (named ogClustWJL). Traditionally people utilize conventional cluster analysis (e.g., K-means) to identify subgroups of patients with similar expression pattern, without consideration of outcome information. Therefore, the subgroups identified can be irrelevant to clinical outcome of interest. Liu et al. (2020) proposed to incorporate outcome information into cluster analysis through a unified generative model (named ogClustGM). However, ogClustGM lacks the flexibility to tune the relative contribution of outcome association and gene clustering separation. In practice, the identified clusters are often dominated by outcome association and the disease subtyping model of omics data cannot work well in independent validation data, which causes overfitting. Our proposed ogClustWJL can take user-defined weight as input to control the contribution of outcome association and gene clustering separation and by finely tuning the weight, potential overfitting can be avoided.

In Chapter 4, we study association between gene expression and multiple phenotypes. In complex disease where multiple phenotypes are available and correlated, analyzing and interpreting associated genes for each phenotype respectively may not only decrease statistical power when association with a single phenotype is weak, but also lose interpretation due to not considering the correlation between phenotypes. We extend two P-value combination methods, adaptive weighted Fisher’s method (AFp) and adaptive Fisher’s method (AFz), to tackle this problem. Based on extensive evaluation, AFp is recommended, which can obtain larger power in determine gene-phenotype association, detect heterogeneity of phenotypes with higher accuracy, and cluster genes based on gene-phenotype association patterns. A real omics application with transcriptomic and clinical data of complex lung diseases demonstrates insightful biological findings of AFp.

Contribution to public health:

The methods proposed in Chapter 2 are crucial for clustering analysis in high-dimensional omics data since K and λ are the tuning parameter input by users and can be critical for the result and interpretation of cluster analysis. Chapter 3 proposes a novel outcome-guided cluster analysis framework for disease subtyping. Chapter 4 provides a practical framework for analyzing the association pattern between multiple phenotypes and gene expression in complex diseases. The proposed methods in this dissertation are all essential tools to uncover mechanism of diseases and develop efficient treatments towards precision medicine.

Event Details

Please let us know if you require an accommodation in order to participate in this event. Accommodations may include live captioning, ASL interpreters, and/or captioned media and accessible documents from recorded events. At least 5 days in advance is recommended.


Please contact Yujia Li for the Zoom link

University of Pittsburgh Powered by the Localist Community Event Platform © All rights reserved