Abstract
Keywords
Introduction
The ability to uncover and decipher transcriptional regulation systems constitutes an invaluable tool in molecular biology1,2 and represents a major challenge in bioinformatics.3–5 Transcriptional regulation is mediated mainly by a subset of proteins known as transcription factors (TF) that bind DNA and can either hinder (repressors) or promote (activators) the formation of an open complex by the RNA-polymerase holoenzyme. 6 Transcription factors recognize a relatively small set of sites collectively known as the binding or sequence motif, 7 often represented using sequence logos. 8 The semi-specific recognition of binding sites by their cognate transcription factors allows implementing computational tools for the discovery and detection of transcription binding motifs and sites. 1 Motif discovery methods focus on the identification of overrepresented patterns in groups of sequences to generate a motif description.3,9 Conversely, site search algorithms take in a motif description and use pattern matching techniques to search for putative sites on DNA sequences.9,10 Scanning genomic sequences in this way leads to a noisy but informative reconstruction of transcriptional regulatory networks, which can be later validated by in vitro and in vivo methods,11–13 or linked to other sources of information in order to reconstruct regulatory networks.1,14–16
In the past, numerous studies have exploited comparative genomics approaches to analyze the composition and conservation of transcriptional regulatory networks or regulons. These studies can yield important insights into several facets of transcriptional regulatory networks that are difficult to approach experimentally. By assessing the spread of a given regulatory signal, for instance, one can infer the ancestry and biological relevance of a regulatory mechanism.17,18 Similarly, the analysis of the genetic makeup of a regulon across different species can shed light into its core evolutionary conserved components and reveal previously unidentified regulon members.11,12,19 Comparative genomics approaches to regulatory network analysis rely on the basic notion that regulons are composite entities that aggregate four different elements: the regulatory TF, a biological function, the network of regulated genes and the motif recognized by the TF. Most comparative genomics approaches to regulon analysis make the implicit assumption that both the regulatory TF and its biological function are preserved. Motif discovery techniques relying on comparative genomics typically assume also that both the network of genes and the TF-binding motif are preserved in order to apply phylogenetic footprinting techniques to enhance motif discovery.20,21 In contrast, regulon analysis techniques based on site search assume only conservation in the TF-binding motif and seek to elucidate variations in regulon composition.1,11,12,17–19,22
Forfeiting the requirement of network conservation makes site search-based analyses of regulatory networks by comparative genomics implicitly dependent on an initial description of the TF-binding motif. This description is typically based on a model organism in which a substantial number of sites12,17,19 or regulated genes17,18 is known from previous experimental work. In the latter case, a motif discovery tool is applied to gene promoter regions to generate a candidate motif to start the multi-genome search. More recently, mutational analysis of a single site has been proposed to construct a viable TF-binding motif. 13 Still, these strategies require the generation or availability of previous experimental knowledge in a model organism. This is inconvenient because this model organism can sometimes be relatively distant from the clade of interest, casting doubts on the underlying hypothesis of TF-binding motif conservation.
Many prokaryotic transcriptional networks can be described in terms of the single-input module (SIM) paradigm or as variations and elaborations of this basic configuration.23,24 In this connection paradigm, a single regulator controls the temporal activation of several cis-regulated genes.
24
It has been observed previously that the master transcription factor of a SIM is often self-regulated,
23
and that self-regulation is even more prevalent in repressor-based SIMs.
25
The bacterial SOS response is a well-known example of self-regulated SIM regulatory network.
26
The SOS response regulates a variable number of genes that are under direct transcriptional control of the LexA repressor.
27
In
The LexA protein has been shown to recognize an unusually large repertoire of binding motifs across the Bacteria domain, with more than 15 distinct motifs described to date.
27
This variety in binding motifs is associated with substantial diversity in regulon composition, which has been mapped in some bacterial classes through comparative genomics approaches.11,22 The evidence compiled thus far through experimental and in silico techniques suggests that there is a small set of genes persistently regulated by LexA in most bacteria.
27
This core LexA regulon comprises the
The ever-growing abundance of genomic and meta-genomic data ensures that, within a given phylogenetic group, many sequences encoding orthologs of the same transcription factor will be readily available. By coupling the assumption of self-regulation to that of motif conservation, one can theoretically forgo the need for experimental knowledge in a model organism. Motif discovery algorithms can be applied to the upstream region of the orthologous genes encoding the transcription factor of interest in order to generate a candidate motif to conduct site search-based analysis of a simple transcriptional network. Taking advantage of the recent availability of experimental data on the SOS transcriptional network in several Gram-positive bacteria, here we provide proof of concept for this approach and we compare it against the conventional method based on extension from a single experimental model organism. Our results reveal that this approach is powerful enough to generate de novo transcriptional network maps, which can be used for functional annotation. Furthermore, we show that the use of a phylogenetically-broad sampling base for motif discovery can yield robust motifs for site search, generating more consistent results than the conventional methodology. We also show that the necessary steps of this approach can be extended to other transcriptional repressors and tool suites. Finally, we use this approach to generate for the first time a systematic mapping of the core LexA regulon in Gram-positive bacteria. This map reveals distinct patterns of LexA regulon composition between Firmicutes and Actinobacteria and supports the notion that cell-division inhibition is persistently regulated by the SOS response in these bacterial groups.
Algorithms and Datasets
Identification of Transcription Factor Homologs
Homologs for the master transcription factor of the genetic system under analysis were identified as best bidirectional BLAST hits36,37 on a balanced set of genomes from the clade of interest. This set of genomes was generated by selecting at least one, and no more than two, species for every major genus within the clade under study. The intent of this strategy was to maximize coverage while avoiding biases in representation due to the uneven distribution of sequencing projects among genera, which could distort the ensuing motif discovery process. Species were selected using the Integrated Microbial Genomes (IMG) system of the Joint Genome Institute 38 and homologues of the transcription factor were identified as best bidirectional BLAST hits using the protein sequences of well-established homologs for each phylum/class analyzed and a minimum e-value of 10–20 on the IMG BLASTP service (http://img.jgi.doe.gov/).
Motif Discovery
For each of the identified transcription factor gene homologues, the region 250 bp upstream of the predicted translation start site, which is known to harbor most promoter elements in bacteria, 39 was extracted using the IMG export service. These 250 bp regions were then fed into the MEME service of National Biomedical Computation Resource (http://meme.nbcr.net/) using any number of occurrences for a single motif, a predefined 10–20 bp motif length and otherwise standard parameters. Previous work has shown that most bacterial transcription factors target motifs in the 10–20 bp range 40 and a variable number of occurrences is required to factor in the multiplicity of binding sites described for many bacterial promoters. 41 Whenever the best motif identified by MEME was found to be a palindrome, motif discovery was repeated on a single strand with the palindrome-only option set to refine the model. Alternatively, motif discovery on these same regions was performed using the PhyloGibbs Online service (http://www.phylogibbs.unibas.ch/) using default parameters. A basic phylogeny was estimated with the WUR CLUSTALW server (http://www.bioinformatics.nl/tools/clustalw.html) using the protein sequences of the transcription factor for the CLUSTALW alignment20,42 and provided to the PhyloGibbs Online service as a Newick-formatted tree file. Because PhyloGibbs does not allow for variable motif input, we conservatively set motif width to the maximum value used for the MEME experiments (20 bp). For the purposes of the comparative genomics analysis, the model was further refined by using the MEME-inferred motif to search again the upstream regions with FITOM (see below) looking for additional binding sites. Search results with scores greater than two standard deviations below the mean of the MEME-inferred collection were considered putative binding sites and added to an expanded collection used as the standard in subsequent genomic searches.
Binding Site Search
In silico searches of putative binding sites were performed with FITOM
10
and xFITOM
43
(http://compbio.umbc.edu/software). These programs take in a collection of known sites, from which they derive a Position-Specific Frequency Matrix (PSFM). Different scoring methods based on information theory can then be applied to search a given target sequence and the programs annotate results based on the proximity of candidate sites to gene regulatory regions. The searches reported here were all conducted using the sequence information content (
Benchmarking of the site search process was performed using collections of experimentally validated sites32–35 as reference for different genomes. For any given genome, searches were run using different collections of sites to define the search motif. Receiver Operating Characteristic (ROC) curves were then generated by plotting the percentage of experimentally validated sites (true positives) with respect to the percentage of non-experimentally validated sites (false positives) detected by the search process when using different thresholds. ROC-curves are shown only for the high-specificity thresholds typically used in site search.
Genome Sequences
A set of representative species from a phylogenetic group of interest was selected to perform the comparative genomics analyses. Representative species were chosen to include those in which binding sites for the transcription factor of interest had been experimentally reported as well as species with available reference sequences in the NCBI RefSeq database comprising all major orders within the group, while including a relatively low total number of species to allow detailed analyses of results. Genome sequences for all the selected species were downloaded from the NCBI GenBank database.
Experimental Datasets
Collections of LexA-binding sites for
Comparative Genomics Analysis
To perform the comparative genomics approach, searches with the inferred binding motif were carried out using xFITOM on all genome sequences selected for analysis. Search results were then parsed sequentially, going from systematically high to low scoring sites across all genome results files. A threshold of two standard deviations below the average for the inferred binding motif was applied initially to select candidate sites. For each candidate site, homologues of the putatively regulated genes were identified as best reciprocal BLAST hits against the other selected bacterial genomes. Identified homologues were then mapped back to the corresponding results file. Whenever a gene was found to be putatively regulated in a new genome, the threshold for that particular gene was decreased by one additional standard deviation, down to a maximum of six deviations below the mean when putative evidence of regulation had been established in five or more species. This lower threshold was chosen because it identified binding sites in 95% of the gene upstream regions used for LexA motif discovery. A gene was considered to be putatively regulated if binding sites meeting the above criteria were located upstream of its orthologs in at least two different species.
46
The complete process for parsing search results files, identifying gene homologues as best reciprocal BLAST hits and assessing putative regulation (Fig. 1) was automated using custom Perl scripts. The validity of this comparative genomics approach was qualitatively assessed using the RegPredict Regulon Inference service (http://regpredict.lbl.gov/) with default parameters.
47
Schematic representation of the comparative genomics approach used in this work. (A) Motif discovery. Self-regulation is assumed for the transcription factor of interest, which is identified univocally in a particular genome. A uniform sample of genome sequences from a given phylogenetic group is selected and multiple homologues of the transcription factor are identified in these genome sequences as best-bidirectional BLAST hits using the TF of interest as the starting query. The upstream sequences of the genes coding for the TF homologues are retrieved and a motif discovery algorithm is applied to them. The resulting best motif model is refined by exploiting its palindromic nature and the collection of sites that compose it is expanded by re-searching the gene upstream regions for additional sites. (B) Comparative genomics.
Results and Discussion
Generation of the Gram-Positive LexA-binding Motif
This work explores the feasibility of applying two simple assumptions regarding a predicted transcription factor (self-regulation and motif conservation) to take advantage of the availability of genomic and meta-genomic data in order to yield a first-order map of its transcriptional network. As a test case, we use the LexA protein of Gram-positive bacteria, for which there is comprehensive experimental data available on several organisms. An obvious first step in pursuing this goal is to generate a valid candidate TF-binding motif to initiate the search procedure in target genomes. Based on the above assumptions, a candidate TF-binding motif can be obtained by applying a motif discovery algorithm to the promoter region of homologues of the transcription factor under analysis. Here we make extensive use of the standard motif discovery algorithm MEME, 45 but the same kind of analysis may be performed with motif discovery algorithms that incorporate phylogenetic information.20,21,48
To identify the LexA-binding motif, we selected a representative sample of 67 Gram-positive genomes (Supplementary data 1) and we identified LexA homologues as best-bidirectional BLAST hits using the LexA-binding motifs. (A) Best-scoring motif reported by MEME based on 50 sites, (B) search-expanded motif encompassing 71 sites, (C) motif from experimental sites in 
We assessed the dependency of the motif discovery approach on the number of sequences by randomly sampling the 58 promoter regions in groups of 48, 24, 12, 6 and 3 sequences and using either the MEME or PhyloGibbs motif discovery services. Even with the more restrictive parameter settings of PhyloGibbs, the LexA-binding motif was identified routinely using as few as 6 sequences, with motif discovery stabilizing fully at 48 sequences (Supplementary data 5 and Supplementary data 6). We also performed a qualitative analysis of the generality of the motif discovery approach by applying it to other self-regulated transcription factors in different phyla. We analyzed three additional transcriptional repressors (Rex, HrcA and TyrR) in, respectively, the Actinobacteria, the Firmicutes and the Gamma Proteobacteria, and we were able to infer the reported experimental motifs for all three,49–51 indicating that this type of analysis can be extended to other transcriptional systems (Supplementary data 7, Supplementary data 8, Supplementary data 9).
Search Performance of the LexA-binding Motif
Conventional approaches to regulatory network analysis by comparative genomics typically exploit experimental data, either in the form of binding site collections or known regulated genes, in a single reference species.11,12,17–19 A foreseeable problem with this approach is the progressive unreliability of the experimental motif as the phylogenetic distance between source and target species increases. Here we decided to evaluate the impact of this effect on a phylogenetically broad group of bacteria (Gram-positive bacteria) and we analyzed whether our approach, based on a single-gene multi-species derived motif, might also be subject to a similar effect.
We used published experimental results on the composition of the LexA regulatory network in four different Gram-positive bacterial species (the Firmicutes ROC curves for search efficiency with experimentally-validated and MEME-derived collections on four different genomes corresponding to the Firmicutes (left) and Actinobacteria (right).
Differences in the specific shape of the Gram-positive LexA-binding motif have been noticed before,32,33,35 but their specific evolutionary relationship and their impact on search efficiency had not been assessed directly. Our results show that differences in the LexA-binding motif of Gram-positive bacteria stem mainly from the evolutionary split between Firmicutes and Actinobacteria. Furthermore, the ROC curves demonstrate that the differences observed among LexA-binding motifs have a definite impact on search efficiency. The sequence logos shown in Figure 2 illustrate how the MEME-inferred motif combines traits of both the Firmicutes (dominance of dyad central positions) and Actinobacteria (importance of spacer and adjacent positions) that allow it to perform well on both phyla. These results thus support the use of a phylogenetically broad sample for motif discovery when conducting comparative genomics analysis, as this leads to a generic motif that can achieve competitive search efficiencies in all target genomes.
Comparative Genomics of the Gram-Positive SOS Network
Composition of the LexA regulon in Gram-positive bacteria.
Only two genes,
The comparative analyses of search efficiency reveal a consistent phylogenetic split between Firmicutes and Actinobacteria at the LexA-binding motif level (Figs. 2 and 3). This phylogenetic divide is also clearly visible in the repertoire of repair genes regulated by the SOS response in both clades (Table 1). The Firmicutes, for instance, maintain LexA-regulation of the excision repair
In contrast with the Actinobacteria, the Firmicutes appear to lack regulation of another hallmark of the
Regulation of Cell Division by the SOS Response
In
Our analysis suggests that this trend towards regulation of cell division by the SOS response through divergent pairing of a cell division suppressor with Schematic representation of the genomic region encompassing the 
Conclusion
This work analyzes the feasibility of exploiting basic assumptions of simple transcriptional networks in order to infer regulatory motifs from the vast amount of genomic and meta-genomic data available, and to reconstruct regulatory networks through comparative genomics using the inferred motif. Our results provide proof of concept for this approach using the SOS regulatory network of Gram-positive bacteria as a test case, paving the way for the development of automated methods that make use of the overabundance of sequence data for de novo inference of simple regulatory networks. Furthermore, benchmarking against experimental data suggests that inferred motifs may yield more robust search patterns. The analysis of the SOS response in Gram-positive bacteria shows clear differences in the composition of the LexA regulon between the two main groups of Gram-positive bacteria and reinforces the notion that part of the DNA repair machinery of the Actinobacteria is regulated independently of LexA. Finally, our study suggests that regulation of cell-division by the SOS response is prevalent in Gram-positive bacteria, providing further evidence of convergent evolution of this trait and pointing to interesting candidates for experimental research.
Author Contributions
Conceived and designed the experiments: IE. Analysed the data: JPC, FM, JRT, IE. Wrote the first draft of the manuscript: IE. Contributed to the writing of the manuscript: JPC, IE. Agree with manuscript results and conclusions: JPC, FM, JRT, IE. Jointly developed the structure and arguments for the paper: JPC, JRT, IE. Made critical revisions and approved final version: JPC, IE. All authors reviewed and approved of the final manuscript.
Funding
This work was supported by the UMBC Office of Research through an Special Research Assistantship/Initiative Support (SRAIS) award.
Competing Interests
Author(s) disclose no potential conflicts of interest.
