Abstract
Keywords
Introduction
Transcriptional regulation of gene expression is necessary for living cells to function, and therefore revealing the mechanisms behind this regulation is a central aim of systems biology. Such regulatory mechanisms are known to function via complex relationships among DNA, RNA and proteins, and to respond to intracellular signals and extracellular conditions to ensure proper gene expression. 1 One of the most common regulatory mechanisms involves protein-DNA interactions. The proteins that bind to a DNA sequence are known as transcription factors, and they are crucially involved in regulating their target genes. Detailed knowledge about how these transcription factors conduct this regulation is essential when inferring a gene regulatory network.
Investigations of gene regulatory systems or complex functional networks among DNA, RNA, proteins and other cellular components in a living cell conventionally follow a standard protocol. After a DNA sequence is completed, the mRNA level is measured by a cDNA microarray, to reveal the gene expression profiles under various conditions. With this information, the regulatory networks can be inferred by employing a number of approaches, including Boolean and Bayesian networks.2,3 I previously developed an approach based on a Graphical Gaussian Model (GGM) in combination with hierarchical clustering.4,5 Among the graphical models, GGM is the simplest in a mathematical sense, as the conditional independence between variables is estimated from the partial correlation coefficients. However, GGM infers only the undirected graph, whereas Boolean and Bayesian models infer the directed graph, which shows causality. Although all of these approaches are suitable for establishing relationships among genes, it is difficult to reveal the concrete interactions between proteins and genes, because of the insufficient information about proteins in the gene expression profiles. Thus, an alternative approach is needed when investigating transcriptional regulation that involves protein-DNA interactions.
One possible technique is Structural Equation Modeling (SEM). 6 SEM has been successfully used to elucidate causal relationships in disparate fields such as econometrics, sociology and psychology.7–9 In addition, it has been applied to Quantify Trait Loci (QTLs) for association and linkage mapping in biology,10,11 as well as to identify genetic networks from microarray data or SNP data.12–14 The significant features of SEM are the inclusion of latent variables into the constructed model and the ability to infer the network, including the cycle structure. Additionally, the linear relationships between the latent variables and the observed variables are assumed to minimize the differences between the fitted covariance matrix and the calculated sample covariance matrix.
To clarify the effects of latent variables in a transcriptional regulatory model, I selected a serial transcriptional regulatory system composed of GAL-related genes. This regulatory system in
Previous investigations have revealed that transcription factors recognize and bind to upstream DNA motif sequences to regulate the target genes.24,25 Some target genes have been empirically confirmed, and others were estimated by computational analyses.26,27 Analyses based on gene expression levels and motif sequences have indicated that various gene expression regulator mechanisms involve Mig1p and Gal4p. Although many genes have been estimated as the targets of those regulators, the regulation of other target genes remained unclear. Therefore, to characterize the entire mechanism of the serial transcriptional regulation, a network model that includes information on the relevant genes and proteins is necessary.
Here, I applied SEM to reveal a serial transcriptional regulation system that is mediated by transcription factors, by using information from numerous gene expression profiles. Since microarray data do not reflect protein expression, a network model that includes not only genes but also proteins should provide new information on the genetic regulatory architecture, assuming that the transcriptional regulation by the transcription factors involves physical interactions between the factors and the corresponding DNA. In this study, I used SEM to describe the transcription factors as latent variables and their effects. This method estimates not only the number of variables in the model, but also any significant interactions between them. The resulting gene expression profiles have clarified the complicated gene expression mechanism.
Methods
Outlier removal
The Grubb's test was used to identify and delete outlier gene expression data derived from the Gene Expression Omnibus (GEO). The Grubb's test was performed on a gene-by-gene basis for each experimental condition, to detect and delete the abnormal numerical data that are included as flags of experimental error. To quantitatively identify the outlier data, a
Here,
Data selection
A multilevel hierarchical regulation system between proteins and DNA is considered to be a good model for inferring protein-DNA interactions. To detect the hierarchical regulation in yeast, I utilized the TRANSFAC Database (http://biobase-international.com/index.php?id=transfac). TRNSFAC is one of the useful databases that include information about transcriptional regulation.
Among all registered yeast transcription factors in the TRANSFAC database, 108 genes were detected as encoding transcription factors, which regulate 252 other genes. Some of these regulated genes had empirical confirmation for their transcriptional regulation, but other regulated genes were estimated by a computational analysis of their 5′ UTR sequences. To compose the hierarchical regulation among the genes, I compiled 864 binomial relationships between the 108 transcription factor genes and the 252 regulated genes. The regulatory network with hierarchy was constructed from these binomial relationships. Among the constructed hierarchical structures, a 3 layered structure composed of 15 genes was the most complicated hierarchy. In this 3 layered hierarchical structure, many parts have been experimentally confirmed, but some parts still remain uncertain. The details of the components in this hierarchical structure are shown in Table 1.
Known binding sites and regulated genes of selected genes.
In the TRANSFAC database, the target genes of each transcription factor have been estimated from the binding site motif sequences. In addition, the database provides quality scores ranging from 1 to 6, which reflect the experimental reliability of a particular protein-DNA interaction. Table 1 lists the regulated genes and their respective quality scores for each transcription factor. Ten genes are known to be targets of Mig1p, but three of them have not been empirically confirmed as targets. Furthermore, for 5 genes among the ten genes, their binding locations and binding site sequences have not been identified. On the other hand, the five genes known to be targeted by Gal4p have experimentally confirmed binding sites. In the last stage of this serial transcriptional regulation, the transcription factor Gal80p targets two genes. Among the 14 regulated genes in Table 1, the quality scores of the two genes were 1, which represents experimentally confirmed transcriptional regulation. The quality scores of the putatively regulated genes were 2, meaning that the transcription factor protein has been confirmed to bind to the motif sequences of the regulated genes, but the transcriptional regulation has not been confirmed. Thus, the inference of the transcriptional regulation of these genes is informative for the functional confirmation of transcription factor binding.
I utilized an abundance of gene expression data for the 15 genes to reconstruct and clarify the 3-layered hierarchical structure composed of them. Since the genes within each layer are controlled by one transcription factor, this 3-layered structure is considered to represent serial transcriptional regulation passing through 3 stages, connected by transcription factors.
All gene expression profiles were obtained from
Factor analysis
A skeleton network structure of serial transcriptional regulations was constructed by using SEM. The framework of the network structure defined transcription factors as latent variables, and their coding genes and their target genes as observed variables. In this study, 15 genes were described as observed variables, but the number of latent variables was unknown. To find this number, factor analyses were used in each of the three stages of the serial transcriptional regulation.
Factor analysis is a statistical method for describing the variability among observed variables in terms of a potentially lower number of latent variables.
28
The initial assumption is that any observed variables may be related with any latent variables. Let us assume that there are
where
If there are
In the factor analysis model, the error terms ε are independent and multivariate normally distributed with a mean of zero, ε
The following assumptions are imposed on
The variance-covariance matrix between the observed variables ∑ is given by
The covariance matrix of the observation variables is structurized by the parameters. From this structurized matrix, the values of the partial regression weight matrix Λ and the variances of the “errors” ε are estimated.
To clarify the possible number of latent variables in the network, Exploratory Factor Analysis (EFA) was performed. EFA is utilized to reveal the latent structure by assuming that the observational data are a synthetic amount of a lower number of latent variables. The number of latent variables was suggested by a principal factor method with varimax rotation, which is a general method for rotating factors to fit a hypothesized structure of latent variables. To extract the number of latent variables as factors of observed variables, an eigen value >1 and a scree plot were applied, as usual. 28
The number of latent variables estimated by EFA may not correctly result in the network with latent variables. To determine the number of latent variables in each stage, I performed a Confirmatory Factor Analysis (CFA) for all numbers smaller than or equal to the number of estimated
Model assumptions
Regulatory network analysis by SEM consists of two parts: parameter fitting and structure fitting. Before parameter fitting, a network structure must be assumed. The framework of the network structure defined transcription factors as latent variables, and their coding genes and their target genes as observed variables. Based on their relationships, the transcription factors were considered as the predictor variables for the target genes, which in turn were represented as the criterion variables. Causality between transcription factors and their coding genes was treated inversely, with coding genes as the predictor variables and transcription factors as the criterion variables. Therefore, this SEM is a particular implementation of the Multiple Indicator Multiple Cause (MIMIC) model, with the latent variables and observed variables arranged alternately throughout the three transcription factor-regulated stages of the network.
Each gene was arranged as a variable in the network, according to the information registered in the TRANSFAC database. The three targeted transcription factors, Mig1p, Ga14p, and Ga180p, are known to be regulators of gene expression in the hierarchical transcription. I defined these proteins as latent variables. However, the number of latent variables in each stage and the details of the regulatory relationships between latent variables and observed variables were not initially known. Thus, I applied a modified four-step procedure, based on work by Mulaik and Millsap, which is known as a method for constructing models when there is no information about latent variables (Fig. 1). 29

Modified four-step procedure. Step 1, construction of unrestricted models for each stage of the serial transcription; Step 2, construction of measurement models for each stage; Step 3, construction of structural equation models for each stage; Step 4, stepwise modeling to connect the different structural equation models.
Step 1:
Step 2:
Step 3:
Step 4:
Structural Equation Modeling (SEM)
SEM is a comprehensive statistical model that includes two types of variables: observed and latent. These variables constitute the structural models that consider relationships between latent variables and the measurement models that consider relationships between the ob served variables and the latent variables. These relationships can be presented both algebraically, as a system of equations, and graphically, as path diagrams.
In this study, the target genes and the coding genes of the transcription factors in the transcriptional regulation were defined as observed variables, since their expression data were obtained from the GEO. Meanwhile, the transcription factors were defined as latent variables, owing to the lack of measured data in the expression profiles. All variables were classified as one of two types: exogenous variables and endogenous variables. Exogenous variables are those that are not regulated by other variables in the system. Endogenous variables, on the other hand, are. In my model, the mig1 gene is defined as an exogenous variable, while all other genes are defined as endogenous variables. All latent variables are endogenous variables, since they are regulated by coding genes. There are three possible relationships between the variables: relationships between latent variables, causal relationships between observed variables and latent variables, and regulatory relationships between latent variables and observed variables. The model is defined as follows:
Here, η is a vector of
From the above equations, I have
Each element of the covariance matrix model ∑(θ) is expressed as a function of the parameters that appear in the model. The unknown parameters were estimated, in order to minimize the difference between the model covariance matrix ∑(θ) and the sample covariance
The SEM software package SPSS AMOS 17.0 (IBM, USA) was used to fit the model to the data. The quality of the fit was estimated by four different model fitting scores: GFI, AGFI, CFI and RMSEA. These scores were considered to be useful to clarify the degree of model fitting in this study, since the model can be evaluated by a general threshold, rather than a huge experimental number.
Parameter estimation
To make the model covariance matrix ∑(θ) closer to the sample covariance matrix
Here, ∑(θ) is the estimated covariance matrix;
Results
Factor analysis in each stage
In the hypothetical network structure, 15 genes were described as observed variables, but the number of latent variables was unknown. To find this number, EFA was used in each of the three stages of the serial transcriptional regulation. After the number of transcription factors was determined, CFA was applied with a severer check. The check by CFA is required before the SEM analysis.28,29
EFA was applied to the ten genes regulated by Mig1p in the first stage, the five genes regulated by Gal4p in the second stage, and the two genes regulated by Gal80p in the last stage. In the first stage, four latent variables were extracted by EFA as the effective factors of ten regulated genes, and the parameters of the two models composed of one or two latent variables could be calculated by CFA. The remaining models composed of three or four latent variables cannot be calculated by the partial regression coefficients, even though the number of parameters was smaller than the number of equations. Thus, the limitation of the latent variables in the first stage was considered to be two. In the second stage, two latent variables were extracted by EFA, and the parameters in the model with all extracted latent variables were calculated. Hence, the limitation of the latent variables of the second stage was regarded as two. In contrast, only two regulated genes were included in the third stage. According to the restriction that the number of latent variables is smaller than the number of observed variables, the number of latent variables in the third stage was considered to be one. Actually, only one latent variable was extracted by EFA, and the parameters of the model composed of one latent variable could be calculated by CFA.
By a combination of EFA and CFA, the maximum number of latent variables in each stage was assumed to be two. Furthermore, the regulatory factors were expected to be encoded by only one gene in each stage of this hypothetical network structure. Thereby, the two factors were considered to be simple substances or complexes of multiple molecules. Although it is possible that the target genes are regulated by other factors, those factors are considered to be encoded by other genes that are outside of this hypothetical network in this study.
After the number of transcription factors was determined, the modified four-step procedure was applied in a stepwise manner. The structure of the resulting model was not simple. For example, the latent variables clearly had high loading for some observed variables, but low loading for others. Similarly, some observed variables were strongly affected by both latent variables, but others were influenced by only one.
Results of SEM
SEM was used to determine the relationships among the transcription factors and the corresponding coding genes in this serial transcriptional regulation. The target genes were considered to be physically bound by the transcription factors represented by latent variables. The regulatory and causal relationships between the latent and observed variables, and the relationships between latent variables were considered in each stage, but the relationships between observed variables were not. By connecting the observed variables, representing the coding genes, to the latent variables, representing the corresponding translated transcription factors, the stage separated models were combined to construct a model for the whole transcriptional regulation in the cell. To connect the different stages, all possible connections between one observed variable and the estimated latent variables were considered. This was done sequentially, with the first two stages being connected by SEM, followed by the third stage. Figure 2 shows the various network structure models between observed variables and latent variables, which molecularly describe a coding gene (gene) being transcribed and translated into one or two effective proteins (F1 and F2).

Possible relationships between variables. The relationships between one observed variable (gene) and two estimated latent variables (F1 and F2). Based on empirical studies, two possible causal relationship directions exist: the observed variable to the latent variables, and one latent variable to the other latent variable. (
All possible models were evaluated in terms of their goodness-of-fit scores by using the goodness-of-fit index (GFI), which measures the relative discrepancy between the empirical data and the modeled network, and the adjusted GFI (AGFI), which is a GFI modified according to the degrees of freedom. Furthermore, I used CFI and RMSEA as fitting scores to evaluate the model fitting. Since these indices have threshold values as criteria to decide whether the model is suited to obtain data independent of a huge sample number, they are considered to be useful to clarify the degree of model fitting in this study. By evaluating the attached models in terms of their goodness-of-fit scores and expanding the network model in a stepwise manner, I constructed a complete model of the serial transcriptional regulation composed of the GAL-related genes.
In the first and second stages, two latent variables were chosen as the regulators of the target genes. The associations from those latent variables to the observed variable had been estimated in the first two steps of the four-step-procedure, and thus the models for inferring the connected part were constructed while maintaining the estimated associations. According to Figure 2, I constructed 7 models while maintaining the association from the latent variables to the observed variables. There was no difference in the goodness-of-fit scores of the 7 constructed models in the first step, and all of the goodness-of-fit scores were low, such as GFI < 0.9 and RMSEA > 0.5. To select the best model, I modified the models as follows: (1) Deletion of the associations between the variables with
Figure 3 shows the estimated relationships between the genes and their corresponding transcription factors. The regression weights for each path are shown in Table 2. Table 2 shows all of the significant regression weights (
Estimated regression weights (

Inferred network model of the GAL regulatory system. (
The relative strength of each association is shown as a standardized regression weight in Table 2. Interestingly, the standardized regression weight of the association from F1 to F2 was negative, even though both F1 and F2 were considered as proteins. This is one of the features of the SEM analysis. The negative interactions between the observed variables are summarized as the existence of a negative interaction between an observed variable and a latent variable. Thus, the negative relationship between F1 and F2 does not indicate the negative association from F1 to F2, but the negative relationships between YGL035C and the target genes of F2.
In the first stage, 6 target genes were regulated by F1 and 5 other genes were regulated by F2, although YIL162W was regulated by both F1 and F2. Among the 6 genes that were regulated by F1, 3 genes have identified Mig1p binding sites in Table 1; the binding sites of YMR011W are −504 to −493, and −427 to −416; the binding sites of YLR044C are −505 to −483, and −451 to −426; and the binding sites of YIL162W are −505 to −483, and −451 to −426. In contrast, the known binding sites of the F2 regulated genes were closer to the transcription start points.
The latent variable F4 in Figure 3(A) shows that the latent variable in the third stage was combined with one latent variable in the second stage. The association from YML051W to one latent variable in the second stage was revealed by the MI scores during the model modification step. Furthermore, the probabilities of the associations between the third latent variable and the target genes were not significant (
Discussion
The scheme of the entire transcriptional regulation composed of GAL-related genes is shown in Figure 4. In the first two stages, the transcription factors are determined by two latent variables. The transcription factor in the last stage, Ga180p, appears as one of the two latent variables of Ga14p. This suggests that some latent variables represent protein complexes. This is no surprise, as Gal80p is known to form a complex with Gal4p.33–37 Similarly, Mig1p is also thought to form a complex with Hxk2p,38,39 although the details of this complex are unclear.

Biological interpretation of the factors. (
According to my inferred network, one of the two transcription factors in the first stage is Mig1p, which is encoded by the YGL035C gene. The known binding sites of F1's target genes were far from the transcription start point, such as around −500, even though there is no information about the Mig1p binding sites for the remaining 3 genes. In addition, the feature of the Mig1p binding position was not detected for F2's target genes. It is suspected the differences between the latent variables were caused by the Mig1p binding site. Similarly, the unknown binding sites of Mig1p in F1's target genes may be expected to be around −500.
The negative relationship between F1 and F2 can be regarded as the negative relationships between F2 and its target genes. The expression of all of F2's target genes is known to be induced by glucose under poor conditions, even though the relationship with glucose for some of F1's target genes was not revealed. Actually, Mig1p is known to be active and bound to other genes under glucose-rich conditions, to repress its target genes.40–43 Thus, the associations between F2 and its target genes may represent repressive regulation by Mig1p. Furthermore, the relationship between the two latent variables in this stage appears causal. The second factor is considered to be a complex involving Mig1p, since Mig1p is known to form a complex when binding to F2's target genes. The estimated regression values between YGL035C and the two latent variables are shown in Figure 4B. Only YIL162W was regulated by both F1 and F2. This may be occurred by the difference between the meaning of arrows from F1 and those from F2. The arrows from F1 indicated the position of physical interaction, on the contrast the meaning of arrows from F2 meant binding regulation by Mig1p complex. The expression of YIL162W is known as to be repressed by Mig1p binding, but its expression may be controlled by the balanced condition in cell.
In the second stage, the observed variable representing the ga14 gene, YPL248C, has a causal relationship with the two latent variables. One of these latent variables is also regulated in the third stage by the ga180 gene, YML051W. The details of this regulation are shown in Figure 4C. The latent variable F4 is regulated by the two genes, and is therefore considered to be a complex of Gal4p and Gal80p. Such a complex has been observed empirically.33–37,44 The model aptly reflects the fact that the regulation of GAL 1, 2, and 7 is controlled by this complex. In general, the target genes in my model were divided into three types: those regulated by a single transcription factor, those regulated by a transcription factor complex, and those regulated by both. Overall, the model successfully describes the hierarchy of the serial transcriptional regulation among GAL-related genes.
Some relationships between the genes were indicated by the MI scores. The tentative relationships between genes have a reverse direction, such as from a gene in the second stage to a gene in the first stage. In this study, those relationships between genes were not included in the model, since they do not clarify the relationships between proteins and genes in serial transcriptional regulation. However, it is possible that the relationships with a reverse direction indicate feedback regulation. To clarify the feedback regulation in this model, all proteins encoded by all genes should be defined as latent variables in SEM, because it has no previous information about latent variables. This approach contradicts the assumption that the number of latent variables should be smaller than the number of observed variables, and the parameters are not estimated. Since a cyclic model can be analyzed by SEM, the feedback regulation will be clarified with the latent variable information in the future.
Although many factors are suspected to regulate gene expression, since their underlying mechanisms are unclear, these regulators can be viewed as little more than a black box. Here, I have shown that SEM is a powerful approach to estimate the gene expression network controlled by transcription factor binding, based on its gene expression profiles. As biological data accumulate, it is expected that SEM will be applicable to a wide number of gene networks to clarify gene-protein interactions.
Conclusions
In this study, one serial transcriptional regulation was reconstructed by a model that incorporated both genes and proteins from only the gene expression profiles, in the absence of protein information. Since the interactions between proteins and genes can be accurately inferred, this approach should be of great interest to systems biologists. The ability to identify expression profiles and the corresponding biological functions is expected to provide further possibilities for SEM in the inference of regulatory mechanisms in cells. As this approach can be applied to numerous systems and organisms beyond yeast, my findings should be of interest to a wide field of biologists.
Disclosures
Author(s) have provided signed confirmations to the publisher of their compliance with all applicable legal and ethical obligations in respect to declaration of conflicts of interest, funding, authorship and contributorship, and compliance with ethical requirements in respect to treatment of human and animal test subjects. If this article contains identifiable human subject(s) author(s) were required to supply signed patient consent prior to publication. Author(s) have confirmed that the published article is unique and not under consideration nor published by any other publication and that they have consent to reproduce any copyrighted material. The peer reviewers declared no conflicts of interest.
