Abstract
Introduction
In bioinformatics, an enduring and fundamental question is how best to use an organism's genome sequence as well as prior knowledge of the DNA sequence preferences of transcription factors (TFs) in order to determine which TFs are responsible for an observed pattern gene expression differences between sample groups, such as tissues at different stages of disease and cells cultured in the presence or absence of a chemical stimulus.1–3 The general approach of computationally analyzing noncoding DNA sequences within 5’ (upstream) regions of differentially expressed gene sets to identify statistically overrepresented TF binding site (TFBS) sequence matches - known as TFBS enrichment analysis4–8 - has proved useful for identifying the gene regulatory mechanisms from transcriptome data.9–14 Databases such as MatBase,
15
TRANSFAC,
16
JASPAR,
17
UniPROBE,
18
Factorbook,
19
and FootprintDB
20
are rapidly accumulating position-nucleotide frequency matrices (PFMs) that represent the sequence preferences of individual TFs. This rapid accumulation is driven by high-throughput assays such as ChIP-seq and protein-binding microarrays and through the use of improved
Reflecting the importance of this problem, multiple computational approaches have been proposed for PFM-guided detection of enrichment of TFBS within gene-vicinal sequences.7,25–27 For the purpose of specificity, I define
Despite its discovery power, TFBS enrichment analysis using prior TF binding pattern information in the form of PFMs has a fundamental challenge that PFMs are highly variable in terms of their specificity for nucleotide sequences and in terms of the uncertainty of the composition of the corresponding PPMs.30,31 Within databases of TFBS sequence patterns, the numbers of representative binding sites from which individual vertebrate TF PFMs have been compiled can vary by four orders of magnitude, from half a dozen to tens of thousands of representative oligomer sequences.15–17 For cases of TFs with highly specific nucleotide affinity and/or very low sampling of representative binding site sequences, PFM counts of zero pose a problem in the standard PPM-based approach and necessitate the use of ad hoc pseudocounts to enable the scoring of nucleotide sequences that do not perfectly match the TFBS consensus sequence.32,33 Furthermore, because the precision of the PPM that is associated with a PFM depends directly on the number of representative binding site sequences used to compile the PFM,
30
TFBS enrichment analysis using only the PPM (and not taking into account the uncertainty in the PPM's structure) can be a source of both type I and type II errors. Finally, in order to assess the significance of a finding that the frequency of PPM sequence matches for a TF is statistically overrepresented for 5’ upstream sequences for a gene set versus for a background set of genes, it is necessary to quantify the
In this article, I describe a Bayesian approach to PFM-guided TFBS enrichment analysis, which produces samples from the posterior distribution of the number of TFBS for a given PFM, within a given sequence. The method incorporates an empirical prior on the per base pair TFBS frequency that is informed by the analysis of human TFBS from the ENCODE project (as opposed to the geometric prior used in a previous study 30 ). Finally, because the method is developed from an explicit joint probability model of all of the observables and model parameters, the method could be readily extended to incorporate other types of regulatory potential scores.38,39 I show empirical results from applying the new prior to estimate the number of TFBS for a synthetic set of promoter sequences in which representative TFBS sequences are introduced. The empirical results show that the new prior improves accuracy when compared to a previously proposed prior on the per-promoter number of TFBS.
Mathematical Preliminaries and Notation
For the purpose of detecting TFs that may control a given cluster of coexpressed genes, it is simplest to consider a single TF at a time; I use “TFX” as a generic symbol for this TF. (Although this article is focused on single-TF enrichment analysis for simplicity, the pairwise TF enrichment analysis could in principle be accommodated by extensions of the general approach described herein.) Consistent with a Bayesian approach, I start by framing the problem of PFM-based TFBS enrichment analysis in terms of a set of random variables including observations, nuisance parameters, and a single parameter - the number of TFBSs within a given set of gene promoters - whose distribution (conditioned on the observations) will ultimately be sampled. To do so, I introduce a bit of mathematical notation needed to define these random variables. It is convenient to denote the set of natural DNA nucleotides by integers, ⅅ = {1,2,3,4}, corresponding to A, C, G, and T (so the complementary nucleotide for nucleotide
The TFX is assumed to have a set of representative binding site sequences (numbering
= (1,…,
r and for which no two binding sites are overlapping by any number of nucleotides (even if the two binding sites have opposite orientations) will be allowed. Thus, the range of B is not the entirety of {-1,0,1}L, but a subset
⊂ {-1,0,1}L defined by the above constraints. In the Bayesian approach to TFBS enrichment analysis that described below, B = ∈||

The distribution of values of c for a collection of 4,528 vertebrate transcription factors from the TRANSFAC Professional 2013.4 and JASPAR 5.0 databases. The sharp peak in the distribution at c = 100 is due to the inclusion of motif matrix information for which the original sequence alignments are not available (such as from high-throughput
Importantly, the probability distribution on the number of binding sites will depend on the length of the DNA sequence being analyzed (longer combined regulatory sequences will, in general, contain more binding sites of a given type), and thus, the probability distribution for ||
A Bayesian approach to analyzing whether binding sites for TFX are enriched within sequences
r = (1, …,
‘r = (1, …,
) has unit
In the application of PFM-guided TFBS enrichment analysis, the observations
that maps a configuration β of binding sites to the set of nucleotide positions within
r that the binding sites occupy. Thus, U(β) is the
:
within one of the binding sites will correspond to a specific binding site orientation (1 or -1), and this correspondence will be denoted by a mapping
,
Random variables in the full probability model for TFBS enrichment analysis.
In the case of a reverse-orientation binding site, the PPM ϕ will correspond to the reverse complement of the nucleotide sequence within the binding site, in which case it is convenient to define a conditional complementation function
by
and which complements
,
Finally, in order to be able to model the joint probability of
. I represent the position-nucleotide counts for the sequence within all TFBS by a ω × 4 matrix σ whose elements are
and
. Because of the physical constraint that binding sites for TFX cannot overlap, it follows that
. In the next section, I introduce the statistical approach by defining a joint probability model.
Bayesian Approach to TFBS Enrichment Analysis
Having defined random variables to represent all of the observed information (
denotes a probability density. Eq. 12 can be derived from first principles based on the following independence assumptions:
The independence structure of Eq. 12 can be summarized in graphical model notation, 45 as shown in Figure 2. To make the joint probability model explicit, each of the conditional probabilities in Eq. 12 will be specified below.

Graphical model diagram of the independence assumptions shown in Eqs. 13–18. Each arrow denotes a relationship between a parent variable and a child variable. Collectively, the variables and arrows indicate conditional independence as follows: each variable × is independent of other variables, jointly conditioned on all parents of X.
Modeling 
For PFM-guided computational recognition of TFBS, a fundamental assumption is that the probability model for the counts of nucleotides outside of TFBS is independent of the probability model for the counts of nucleotides within TFBS.32,46 This means that the conditional probability of
(the TFBS) and corresponding to
(outside the TFBS). Conditioned on Ψ, the nucleotide probabilities at positions outside of TFBS, which are denoted by the random variables {Rl}l∈Lr\u(β) and the random variables {Sl}l∈L’, are assumed to satisfy
r\
’ (where iid denotes independent and identically distributed). Conditioned on ϕ and β, the nucleotide sequence probabilities at locations within the
that are independent and distributed as follows:
r\Û(β). Because the length of
. Given the uniform prior assumption for Ψ and the definition in Eqs. 9 and 10 and the assumptions in Eqs. 19–22, the conditional probability of
” section).
Modeling 
To account for uncertainty in the PFM
due to sampling from a finite (and in many cases, very limited) number of representative binding site sequences, the PFM is represented by a random variable
), are independent and multinomial distributed with a fixed number of trials.47,48 Because in some cases, some representative binding site sequences will be outside the core portion of the multiple alignment from which the PFM is tabulated, the row sums of
may in some cases be less than the count
of representative binding site sequences. Thus, to accommodate such cases, I denote by
the sum of the elements of row
. In terms of the row-specific counts
(for g ∈
), the conditional distribution of
immediately follows
The most common approach for selecting the prior probability
for the PPM is to choose a uniform prior, in which case
is just the constant (3!)ω. Although other authors have pointed out the possibility of using an empirical prior on ϕ,
30
it is nontrivial to collapse Φ by analytic integration over all ϕ, in the case of a nonuniform
, so here I assume a uniform
.
Modeling 
At a given base pair location with no binding sites nearby (and with no sequence information), I model the probability that there is a binding site - in a specific orientation - as λ/2. In the absence of sequence information, intuition would suggest treating the occurrence or absence of a binding site for TFX at each position in DNA as independent and identically distributed Bernoulli trials. However, because of the physical constraint that two binding sites are not allowed to overlap, each binding site (ie, each nonzero entry of β) affects the probability of a binding site at nearby positions. Specifically, each binding site prevents the possibility of an overlapping binding site (in either orientation) at w - 1 bp positions, and for an additional 2(ω - 1) flanking positions, a binding site is only possible in one orientation. Thus, the probability model consistent with the physical constraints would be
is function that is implicitly defined by the law of total probability for
The prior distribution
reflects the range and relative probability of different Λ values for TFX, before the sequence
is important because for real-world applications, it can exert a significant effect on the distribution of Λ|r, c. For mammals, the prior
can be formulated empirically using binding site frequencies (per base pair of noncoding, gene-vicinal DNA sequence) for 620 human TF ChIP-seq experiments (comprising 119 distinct TFs) obtained from the ENCODE project.
43
For each ChIP-seq experiment, binding sites within regions of noncoding DNA within -1500 to +500 bp transcription start sites of VEGA transcripts (from Ensembl Release 75, GRCh37 assembly coordinates) were mapped, using ChIP-seq peak data that were peak-called using the SPP program
49
and for which the data files were downloaded from the ENCODE data access page at the European Bioinformatics Institute from the June 2012 release (http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/integration_datajan2011/byDataType/peaks/june2012/spp/optimal/) in narrowPeak format. The counts of binding sites within these noncoding regions were fit to a Poisson model parameterized by a binding site frequency λ per base pair of DNA; for each ChIP-seq experiment, a λ estimate was obtained using maximum likelihood. The resulting histogram of λ estimates is well-described by a beta distribution, as shown in Figure 3, with parameters as given in Table 2. Thus, it is convenient to adopt a prior density.
where
can be approximated
where with our choice of λmax’ the second-order term can be neglected, resulting in a beta distribution-like dependence on λ,
for λ∈(0,λmax] From this equation, and integrating λ over the range from [0,λmax], we see that the probability model Eq. 26 corresponds to a prior

Distribution of frequencies of TFBS (per base pair of noncoding, gene-vicinal DNA sequence) for human transcription factors, based on the analysis of 620 ChIP-seq datasets from the ENCODE project. 43
Parameter estimate for the beta distribution model for the prior p(λ) on the binding site frequency per base pair, for human transcription factors.
In contrast to Eq. 31, Lähdesmäki and Shmulevich
50
used a geometric prior on, β

Plot of the change in log P(β) with the addition of a single binding site, as a function of β, for the incomplete beta function-based prior (Eq. 31) and the previously proposed geometric prior (Eq. 32). The △log
Obtaining the Distribution of B |r, s’,c
The second step in a Bayesian approach
44
is to condition on the observed data - in this case,
- and then obtain the conditional distribution of the parameter(s) of interest, in this case B. In order to be able to do so, starting from the joint probability model (Eq. 12), the nuisance parameters ϕ and λ must be either estimated or marginalized. A key advantage of the Bayesian approach is that we can take into account the probability distribution of
, in the process of eliminating Φ by marginalization. The parameter Ψ can be similarly marginalized, although the uncertainty in the background nucleotide frequency is generally very small for real-world applications in which
Given Eqs. 12, 23, and 25, the dependence of the integrand in Eq. 33 on ϕ and Ψ has the same algebraic form as the probability density function for independent Dirichlet random variables {Ψ,Φ1,…,Φω} as a consequence of the fact that the Dirichlet distribution is the conjugate prior for the multinomial distribution.
44
Thus, the two integrals in Eq. 33 can be evaluated analytically.
30
The desired joint conditional probability of Λ,
After performing the integrals in Eq. 33, using Eqs. 30 and 34, and then log-transforming, we have
is function that will not need to be evaluated. The parameter λ can then be marginalized by integration, yielding
, it will not be necessary to explicitly evaluate
. Now that we have an explicit formula for
up to additive terms that do not depend on β, it is possible to generate β samples from this distribution using Markov Chain Monte Carlo (MCMC) sampling.
MCMC approach
For sampling from
, the Metropolis-Hastings algorithm,
51
in which a probabilistic proposal generator g(β,β’) for a transition from β→β’ can be defined so as to optimize the acceptance rate for moves, is convenient. For the problem of TFBS enrichment detection, following the general approach used by Lahdesmaki et al for TFBS recognition, I use a two-stage proposal generator in which a base pair position l ∈
r is selected at random, and then, depending on the current state of β, binding site removal or addition (in the latter case, with a randomly selected value
r\[l] it is convenient to define some additional notation in order to make this conditional probability ratio explicit. Without loss of generality, let us assume that the current state for the
, a location l ∈
r such that βl = 0, and an orientation j ∈ {-1,1} such that the configuration β with βl = j would not violate the TFBS physical constraints. In order to simplify notation, I define a function H:
Lr ×
× {-1,1} ×
→
by
r I also define a function
by
Applying Eq. 35 to two different binding site configurations that differ by one binding site being present/absent at a specific location
r, and using the definitions of
and
, we obtain a closed-form expression for the log ratio of the conditional probability of there being a binding site at location
can be accomplished using the Metropolis-Hastings algorithm with the following proposal distribution:
, and where the weight exponent
, the entropy of the three conditional probabilities
and fixed l) is very small, and thus, from the standpoint of optimizing the acceptance rate, it is convenient to weight the generation of proposed moves toward moves with more even odds of move acceptance. Results from empirical testing with sequence lengths Lr = 2 × 104 suggest that a value q = 0.85 gives an acceptance rate of about 0.12, with increasing values of q increasing the acceptance rate. In this study, the Markov chain is initialized by iterating over l = 0 to l = Lr, for each l, setting βl to be the most probable configuration given Eq. 39, conditioned on β
r\{l} 0. Once the Markov chain has converged, at most positions l ∈
r, the entropy of the three conditional probabilities
and fixed l) is very small, and thus, from the standpoint of optimizing the acceptance rate, it is convenient to weight the generation of proposed moves toward moves with more even odds of move acceptance.
Empirical Results
Based on a direct comparison (Fig. 4) of the incomplete beta function prior (Eq. 31) and the previously proposed geometric prior (Eq. 32), it seems reasonable to suppose that, within the context of a Bayesian approach for PFM-based TFBS frequency estimation (as described in the “Obtaining the distribution of
” section) the incomplete beta function prior and the geometric prior might have different effects on the conditional distribution of the number of TFBS, ie, the samples of

Comparing the accuracies of two MCMC implementations of the Bayesian method for estimating the number of binding sites of a TF, based on the geometric prior (Eq. 32) and the incomplete beta functionbased prior (Eq. 31). For each combination of t (number of ground-truth sites) and type of prior, the bar denotes the median, the box denotes the interquartile range, and the whiskers are offset 1.5 interquartile range above or below the 75th and 25th percentiles.
Discussion
This study demonstrates the utility of incorporating an empirical prior on the TFBS frequency per base pair within the context of a Bayesian method for PFM-based TFBS enrichment analysis, but there are several aspects in which the work raises interesting questions that could be explored in future studies. First, in this work, a two-parameter parametric function has been fit to empirical data on the density distribution of frequencies of human TFBS per base pair of noncoding, gene-vicinal sequence. Thus, the results shown here do not reveal to what extent the estimated parameters for the distribution would generalize to TFBS frequencies in other species. At least for the mouse genome, available evidence from the modENCODE project suggests that overall, TF binding within promoter regions is highly conserved between human and mouse. 53 Moreover, for two TFs whose TFBS were assayed in five mammalian species by ChIP-seq, the numbers of genome-wide binding sites did not vary more than 2x between species. 22 Thus, it seems reasonable to expect that the A prior distribution (across TFs) would be similar, for gene-vicinal non-coding sequence. Nevertheless, in future work, it would be informative to estimate the hyperpriors a and V for human, mouse, fruit fly, and worm to enable a cross-species comparison. Second, it would be useful to characterize how the choice of q parameter affects the empirical performance of the MCMC approach used here, ie, the acceptance ratio, the number of steps required for burn-in, and the number of steps required between samples; it may be possible to significantly improve the speed of the proposed MCMC method through tuning q and the sampling parameters. Third, a key aspect to be explored is the extent to which the accuracy improvement with the incomplete beta function-dependent prior is associated with high-count versus low-count PFMs. Intuitively, it seems reasonable to suppose that for most TFs, an increase in the accuracy of the prior would be expected to have more of an effect on the posterior distribution of β when the PFM count is low, since a higher count PFM would be expected to have a much bigger likelihood ratio that would, in turn, be more likely to dominate over the prior on the number of TFBS.
Conclusions
This study presents a Bayesian approach to the bioinformatics problem of PFM-guided TFBS enrichment analysis. The method incorporates an empirical prior on the frequency distribution λ of binding sites for TFs that is based on genome location data from the ENCODE project. In addition, the method incorporates a probabilistic model for TFBS occurrence conditioned on the parameter λ that takes into account the finite width of the TFBS, in contrast to a previous approach in which the TFBS probability was assumed to have a geometric dependence with a fixed factor of 1/2. 30 The sampling equation for adding/removing a binding site (Eq. 39) could be easily extended to include other sources of information, such as a regulatory potential score derived from phylogenetic sequence conservation or from epigenetic measurements. The R software code implementing the MCMC method described in the “MCMC approach” section and the promoter analyses shown in Figure 5 is available at http://github.com/ramseylab/tfbsincbeta.
Author Contributions
Conceived and designed the experiments: SAR. Analyzed the data: SAR. Wrote the first draft of the manuscript: SAR. Contributed to the writing of the manuscript: SAR. Agree with manuscript results and conclusions: SAR. Jointly developed the structure and arguments for the paper: SAR. Made critical revisions and approved final version: SAR. The author reviewed and approved of the final manuscript.
