Date of Award
Master of Science
Some current similarity measurement method include Normal Euclidean Distance, Pearson Product-Moment Correlation Coefficient, Spearman's rank correlation coefficient, Z-Score or standard score, Spearman's Footrule distance, Kendall tau rank coefficient, Jaccard similarity coefficient, Cayley's distance, hamming distance etc, since they cannot capture the similarity between genes with arbitrary time-delay and time-gap behavior, a novel algorithm, which enables time-delay alignment and time-gap alignment is proposed and integrated with some of those existing approaches which are local comparisons to fit into the underlying biological context. Time-delay behavior occurs when a gene's expression triggers a delayed expression in its co-regulated or anti-co-regulated peers. In addition, arbitrary time lag also might appear due to experiment error or measurement error. If any gene data has one or both of those condition, the similarity measurement using traditional methods will either under-estimate the similarity or completely miss such relationship. To align the gene data, an alignment algorithm that can be used to align time-delay as well as removing time-gap was developed. Because both Normal Euclidean Distance and Pearson Product-Moment Correlation Coefficient are local comparisons, the algorithm was able to integrate within those two approaches to accommodate the time-delay and time-gap behavior. All of the implementations are done through parallel programming of Message Passing Interface in C by splitting the work load dynamically from a master server to many slave servers in order to speed up the computation process. Synthetic and real microarray data are used to demonstrate the superior of our proposed method. The experimental results show that Normal Euclidean Distance and Pearson Product-Moment Correlation Coefficient with our alignment algorithm perform better in terms of capturing the similarity of more co-regulated or anti-regulated gene pairs. Some improvements such as isolation of experimental conditions, weighted averages and statistical analysis for threshold setting are proposed. Because such time-delay behavior in gene expression pattern is not unusual and usually play important roles in the cell system, the new approach will help scientists to discover important knowledge that otherwise will not be revealed. This approach is sensitive to capturing a wide spectrum of expression patterns, which tends to be ignored by traditional methods. Global comparison algorithms usually have a pre-step that normalize the entire dataset to achieve better result, thus the implemented time-delay and time-gap alignment algorithm will not be effective on the normalized data set. In order to cope with the intensive computing needs of large-scale microarray data, parallel code under message passing interface in C is developed with dynamic work load balancing strategy and executed at a Linux cluster.
This thesis is only available for download to the SIUC community. Others should
contact the interlibrary loan department of your local library.