Yulan Liang has published more than 60 peer reviewed research articles in the biostatistics, bioinformatics and biomedical science fi elds. She has been funded as PIs by NSF, and served as Biostatistics Core Director of NIH P30, P20, U01 center grants and Co-Investigator of multiple NIH funded grants. She also served as a Co-Investigator and Biostatistician from health associations and private organizations and other funding agencies, including the Robert Wood Johnson Foundation, Alzheimer's Association, National Council of State Boards of Nursing. She has a wide breadth of independent research and collaborative experience in developing, implementing and evaluating integrative novel statistical/computational methodologies for big biomedical data (i.e. omics and translational research) responsive to new quantitative and scientifi c problems arising from advances in medicine, biology and health care systems.
Identifications of disease signature or protein biomarkers have been crucial for medical diagnosis and prognosis, and drug target selection in complex diseases, such as cancer. Statistical models with single feature selection encompass the multi-testing burden with low power if with limited sample size. High correlations among the markers, along with small to moderate effects often lead to unstable selections, and cause reproducibility issues. Machine learning with ensemble feature selections (EFSs) has the advantage to alleviate and compensate those drawbacks. Proteogenomics is an integrated systematic approach that allows assembly of the molecular puzzle from DNA genomics to RNA transcriptomics and protein proteomics. Mass Spectrometry (MS) based proteomic technologies have enabled global expression profiling at the protein level to examine the linkages among proteins, mRNA, genes, and cancer status, as well as help determine which protein markers are linked to ovarian cancer subtypes and treatment heterogeneity. In this work we investigate the MS measured global proteins in relationships to ovarian cancer status (defined based on gene mutation status) and mRNA classes. MS proteomic ovarian cancer data was obtained from the Clinical Proteomic Tumor Analysis Consortium and The Cancer Genome Atlas (TCGA), which include genomic and transcriptomic characterization of ovarian high-grade serous carcinoma and 9606 global proteins measured from MS. We develop three stage homogeneous ensemble feature selection (HEFS) approach for both identifying biomarkers and improving the prediction accuracy for binary cancer outcomes (putative homologous recombination deficiency positive or negative) and multiple mRNA classes (differentiated, proliferative, immunoreactive, mesenchymal, unknown). We further conducted and compared various EFS methods in machine learning models such as random forests, support vector machine, and neural network for predicting both binary and multiple class outcomes. Despite the different prediction accuracies from various machine-learning models, EFSs identify the consistent and reproducible sets of protein biomarkers linked to the outcomes.