Date of Award

9-1-2021

Degree Name

Master of Science

Department

Computer Science

First Advisor

Huang, Chun-Hsi

Abstract

Genotype data, consisting large numbers of markers, is used as demographic and association studies to determine genes related to specific traits or diseases. Handling of these datasets usually takes a significant amount of time in its application of population structure inference. Therefore, we suggested applying PCA on genotyped data and then clustering algorithms to specify the individuals to their particular subpopulations. We collected both real and simulated datasets in this study. We studied PCA and selected significant features, then applied five different clustering techniques to obtain better results. Furthermore, we studied three different methods for predicting the optimal number of subpopulations in a collected dataset. The results of four different simulated datasets and two real human genotype datasets show that our approach performs well in the inference of population structure. NbClust is more effective to infer subpopulations in the population. In this study, we showed that centroid-based clustering: such as k-means and PAM, performs better than model-based, spectral, and hierarchical clustering algorithms. This approach also has the benefit of being fast and flexible in the inference of population structure.

Share

COinS
 

Access

This thesis is only available for download to the SIUC community. Current SIUC affiliates may also access this paper off campus by searching Dissertations & Theses @ Southern Illinois University Carbondale from ProQuest. Others should contact the interlibrary loan department of your local library or contact ProQuest's Dissertation Express service.