Date of Award


Degree Name

Doctor of Philosophy


Plant Biology

First Advisor

Geisler, Matt


AN ABSTRACT OF THE DISSERTATION OF BELAN M. KHALIL, for the Doctor of Philosophy degree in Plant Biology, presented July 11, 2018, at Southern Illinois University Carbondale. TITLE: ANALYSIS OF THE CIS-REGULATORY ELEMENT LEXICON IN UPSTREAM GENE PROMOTERS OF ARABIDOPSIS THALIANA AND ORYZA SATIVA. MAJOR PROFESSOR: Dr Matt Geisler Gene expression in plants is partly regulated through an interaction of trans-acting factors with the promoter regions of the gene. Trans-acting factor binding sites consist of short nucleotide sequences most often present in the upstream promoter region. These binding sites, the cis-regulatory elements (CREs), vary in structure, complexity and function. In binding to trans-acting factors, CREs connect genes to signalling and regulatory pathways that affect plant growth, development, and response to the environment. As words in a language, CREs and thus promoters can be analyzed by looking for spelling (patterns of nucleotides) associated with meaning (functions). Considering CREs as words in a language, this kind of analysis provides a great opportunity for comprehensive understanding of promoter language. Identification and characterization of CREs are challenging either experimentally or bioinformatically, and has previously been accomplished by discovering degenerate words, with ambiguous nucleotides. This kind of result implicitly makes a hypothesis that binding of a specific trans-acting factor is somewhat promiscuous (or sloppy) and that all words represented by a degenerate pattern are equally good at binding. In this study, we unpack the “degeneracy hypothesis” by systematically considering each combination of letters independently for CRE function. Our results demonstrate that not all degenerate combinations of published CREs have the same effect on gene expression. A systematic search and comparison of all 65,536 possible 8 bp CRE words were searched in the 500 bp and 1000 bp upstream promoters of all genes in Arabidopsis thaliana and Oryza sativa, respectively. The function of each CRE was evaluated by statistically comparing the presence or absence of the element in the promoter with that genes response (induction or suppression) to stimuli in 1691 public availability transcriptomes of differential gene expression data. Arabidopsis, a model dicot plant had a much larger number of such data sets, than rice, however rice was chosen as a comparison as it had the largest number of datasets for a monocot, the most distantly related plant group with sufficient data available. A comprehensive list of 8 bp words associated with differential gene expression, linguistically known as lexicon, was retrieved for both species by establishing that the presence of a CRE significantly increased the likelihood for differential expression by at least one stimulus. The lexicons were composed of 641 and 856 CREs respectively in Arabidopsis and rice, and there were only 78 shared CREs between the two lexicons. The CRE lexicon was then characterized for their strength and breadth of response, occurrence frequency, sequence complexity, and sequence conservation between two species. In Arabidopsis, evening element (EE) showed the strongest response to a cold stress transcriptome (p-value 10-99). In rice, the element AAACCCTA showed strongest response to a tissue specific transcriptome (p-value 10-79). The breadth of response varied between the two species due to number of transcriptomes used in the study. The element AAACCCTA and GCGGCGGA significantly correlated to 197 and 58 transcriptomes in both Arabidopsis and rice, respectively. On the other side of the breadth scale there were also many CREs with very restricted response. There were 291 and 258 CREs in Arabidopsis and rice, respectively, significantly correlated to a single stimulus. Occurrence frequency revealed that the most abundant CREs in Arabidopsis and rice genes were TATA box and TATA box like CREs. The structure of the CREs in the lexicon was also varied. CREs were distributed on seven levels of complexity. Level one comprised CREs having 8 copies of the same nucleotide, level seven comprised CREs having two copies of the same nucleotide. In Arabidopsis, out of 641 CREs, 314 were of level 6 complexity, which means having 3 copies of the same nucleotide. In rice, the majority of the lexicon, 263 CREs were of level 5 complexity, which means having 4 copies of the same nucleotide. Each CRE of the lexicon was correlated to at least one experimental condition in the differential gene expression data, but many were correlated to multiple and often related conditions such as drought, temperature and salinity. Therefore, each CRE was assigned a “meaning”, i.e. the associated stimuli, thus providing a sort of CRE function dictionary in addition to the lexicon itself. Many CREs possessed different meanings (termed homographs in language), and in many cases the meanings of different CREs overlapped like language synonyms. Sharing meanings (synonyms) was often among CREs with strong sequence similarity (homonyms or homophones), however, not in all cases. Analyzed as a linguistic aspect, CRE homonymity and synonymity was applied to explore the hypothesis “all CRE synonyms are also homonyms and all CRE homonyms are also synonyms.” To the end a single CRE was compared to all possible CREs with only one letter mismatch in their sequences are considered as homonyms. The CREs meaning was converted to a matrix of stimuli to generate clusters of synonyms that were analyzed for similarity of spelling (sequence). This analysis showed that not all homonyms are synonyms, however most synonyms are homonyms. Furthermore, despite a search of all one letter mismatches among homonyms, many of the functional homonyms shared smaller 4-5bp core sequence and only varied at the flanks. Synonyms being homonyms in the language of promoters raises a question, how did this evolve? Duplication of transcription factors in the genome generated transcription factor families where each family member shares the same core domain, usually a DNA recognition site. We here propose that CREs also duplicate during gene duplication process building CRE families in parallel. Members of CRE families may show different connectivity and affinity to individual members of transcription factors in a transcription factor family. In environmental sensors and developmental decision panel, this association of two families of interaction factors is called dense overlapping region (or DOR) and is a highly overrepresented network topology in biological systems. This also explains the degeneracy of initially discovered CREs. The fact is only a portion of nucleotide combinations implied by a degenerate CRE is bioactive, it represents an overlap of different members of a CRE family which is part of the process of family expansion and diversification and done as compensatory mutations as the family of transcription factors expanded and diversified. We also extensively studied CREs involved abiotic stress and identifies shared elements among abiotic stresses as well as abiotic stress specific CREs. Furthermore, CREs follow a time-sensitive response rule, which means some CREs participates in gene expression regulation only at a certain period during the course of exposure to the abiotic stress.




This dissertation is Open Access and may be downloaded by anyone.