Abstract
Introduction
Micronuclei (MNs) are small nuclear bodies that are formed in dividing cells but are not part of the nucleus. Therefore, MNs can only be found in cells that have undergone nuclear division at least once and appear as small extranuclear bodies. When two daughter nuclei are formed during cell division, these bodies are placed into a smaller nucleus that is not part of the main nuclei, hence the term “micronuclei.” 1 Once the MNs are formed, the cell has several different response options. MNs can remain within the cell, if they have functional DNA, as separate entities or be reabsorbed into the main nucleus. If the DNA is nonfunctional, the MNs may be expelled from the cell or the whole cell may be destroyed through apoptosis. Because MNs can be expelled from the cell, they can be used as a mechanism to remove extra chromosomes from the cell. 1
MNs can form spontaneously or they can be induced by mutagens. Some spontaneous MNs are actually beneficial to the organism. An example is in the mouse cerebral cortex, where in MN formation adds diversity to the nervous system. 1 However, a large majority of MNs are caused by mutagens and may play a role in carcinogenesis. Depending on the fate of the MN, the result could be a variety of different DNA and chromosome cell contents. This variety could result in an accumulation of DNA changes and instability that could result in cancer. 1 Several studies have shown that higher MN counts result in a higher risk of cancer in the future. 1 Thus, using the cytokinesis-block micronucleus (CBMN) assay as a risk assessment tool for cancer has potential clinical benefits. Further, combining CBMN with other high-throughput technologies such as gene expression and methylation analyses may help identify factors related to micronucleation.
Quantifying MNs in patient samples has been shown to be a good measure of genetic damage. MN scoring, ie, counting the number of MNs present in a sample, is a popular tool for testing genotoxicity mostly because of its simplicity, accuracy, applicability to different cell types, and ease of automation. Cancer cells show a loss of genetic control, which can be caused by DNA damage; so, they are good candidates for MN testing. The CBMN assay has successfully been used and validated to score MNs. The CBMN assay uses cytochalasin-B, which stops cells from performing cytokinesis but does not stop nuclear division, giving rise to cells that are binucleated.2,3 Furthermore, the Organisation for Economic Co-operation and Development has developed a set of guidelines for running the CBMN assay to obtain the most consistent and reliable results. 1
Guidelines for the process of scoring MNs have been presented by the HUman MicroNucleus (HUMN) project. This is an international collaborative project aimed at improving the application of the CBMN assay. One of the main goals of the HUMN project is to identify methodological variables in the scoring of the assay to minimize confounding effects. 4 The HUMN project compiled a list of 6583 subjects from 25 laboratories in 16 countries and looked at background MN frequency using the CBMN assay. The goal of the study was to identify variables that affect the background MN frequency. Scoring criteria were found to account for 47% of the observed variability; thus, standardized scoring criteria were developed and described by Fenech et al. 4 The guideline includes scoring 2000 cells to accurately estimate MN frequency.
Because these guidelines were developed for assay performance, they do not address how to statistically analyze the data generated by the assay. This has led to the application of various statistical methods that may render different interpretations and conclusions. In a review article examining analytical methods, Ceppi et al.
5
reviewed 63 studies that statistically analyzed MN data and developed recommendations for selecting an appropriate analytical method. The review included studies that applied both parametric and nonparametric tests. The nonparametric tests included Kruskal-Wallis, Friedman, Wilcoxon, and Mann-Whitney U-tests. Although these tests do not require an underlying distributional assumption, they are unable to adjust for confounding factors. There were a variety of parametric tests performed that assume normality, such as analysis of variance, analysis of covariance, and multivariable linear models, which can adjust for confounding factors. Other methods such as correlations and Student's
When trying to identify molecular features related to MN frequency, high-throughput genomic assays can be used. However, the previously described methods cannot be applied in settings where in there are more predictor variables than samples. Therefore, in this study, we extended the generalized monotone incremental forward stagewise (GMIFS) method to the Poisson regression setting and applied it to a cord blood study, the MN frequency of which we were interested in predicting using features from the Agilent 4 × 44k human oligonucleotide microarray.
Methods
Data
The cord blood data were collected as part of the Norwegian Mother and Child Cohort Study (MoBa). 6 The target population of MoBa comprised all women who gave birth in Norway. The overall goal of this study was to collect data on pregnant women and their children to estimate the association between exposures and diseases. Specifically, the data are taken from a subcohort called BraMat, which translates to “good food” in English. This subcohort concentrates on what effect a pregnant woman's diet has on her child. Umbilical cord blood samples were collected immediately after birth from 200 babies. After quality control and other exclusions, 111 samples were hybridized to Agilent 4 × 44k human oligonucleotide microarrays to measure gene expression. Of the 111 subjects, 29 also had MN data collected. The MNs were scored using the procedure described by Decordier et al. 7 Further, demographics such as gender, were collected for all subjects. Data were downloaded from Gene Expression Omnibus (GSE31836). Sample processing, image analysis, normalization, background correction, and filtering are described in the study by Hochstenbach et al. 8 For this analysis, the data were further filtered to only include genes that had no missing values, leaving 8497 genes for statistical analysis.
Statistical Methods
There are many available methods that can model count data. However, these methods require independence of explanatory variables (p) and that the number of samples (n) does not exceed the number of explanatory variables. The incremental forward stagewise regression method for linear regression and the GMIFS for a logistic regression model have been previously described.
9
The GMIFS method for modeling ordinal response data has also been described.
10
To assist in our extension to the Poisson regression setting, we first review Poisson regression. We subsequently describe our GMIFS method for fitting Poisson regression models when
Poisson Regression
Poisson regression is commonly used to model count data. Let
Then, the conditional probability is given by
Mathematically, it is easier to maximize the log-likelihood, which is given by
Thus, we are looking for the value of λ that maximizes the log-likelihood above. Further, an offset is used if the response variable can be considered a rate. For example, MN frequency is scored from a larger number of total cells. Therefore, if the total number of cells examined varies by subject, an offset is appropriate. In this case, the expected value is
Again, mathematically, it is easier to maximize the log-likelihood, which is given by
Once again, we are looking for the λ value that maximizes the log-likelihood. These log-likelihoods are used to model predictor variables. In Poisson regression, the model assumes that the expected value can be modeled by a linear combination of predictors. In this case, the natural log of
GMIFS Poisson Model
The GMIFS method was previously described for the logistic regression scenario by Hastie et al.
9
but can be adapted to a Poisson regression model. For the proposed method, we consider three types of parameters that θ from the section “Poisson regression” can be separated into along with an offset (
The algorithm proceeds in an iterative fashion and updates one of the penalized covariates by a small incremental amount at each step. To determine which penalized covariate is to be updated next, the largest negative gradient is used. Thus, we need to calculate the first derivative of the log-likelihood corresponding to each penalized predictor. The log-likelihood written in terms of α, β, and γ is
Once we know which covariate to update, we need to determine in what direction to update the covariate. To know the direction of the update, the second derivative would need to be calculated, which is a cumbersome process. Hastie et al.
9
showed that to avoid having to calculate the second derivative, an expanded covariate space can be used. For example, let β1,…, β
Initialize the components of Initialize the intercept α and the unpenalized coefficients γ
Considering α and γ fixed, find the predictor Update the corresponding coefficient Update α and the unpenalized coefficients, γ
Repeat steps 3–5 until the difference between successive log-likelihoods is less than a prespecified tolerance, τ.
The defaults for the GMIFS algorithm are ε = 0.001 and τ = 0.00001.
Comparative Method: Penalized Linear Regression
A penalized linear regression model can be fit by adding a penalty term to the sums of squares. Specifically, the glmpath algorithm uses a linear combination of the
The glmpath algorithm uses a default binomial distribution with a logit link and λ2 = 0.00001. The algorithm also allows for a Poisson distribution with a log link and Gaussian distribution with an identity link. The algorithm then computes the regularization path for generalized linear models with L1 penalty.
Simulations
Simulations are a useful technique to test how well a new methodology performs. In this case, we wished to quantify how accurately the GMIFS method estimated true nonzero coefficients and predicted count data. Furthermore, we wished to determine how the GMIFS method compared relative to the glmpath method in predicting the count outcome, and simulations provide a good platform to accomplish this comparison. Several general steps must be considered in the simulation process: how to simulate the response, how to simulate the predictors associated with the response, and how to simulate the predictors not associated with the response. Furthermore, we wished to examine how the methods perform under ideal situations and nonideal situations, such as when distributional assumptions are met and are not met, respectively. Note that all simulations were performed using the R programming environment (version 3.1.1). 13
First, we considered the situation where the response is Poisson distributed and the user fits a Poisson regression model. Then, we generated the response to follow a Poisson distribution where an offset was either used or not used. The uniform distribution was used to generate the predictors. The steps involved in simulating the data under these conditions were as follows:
Randomly generate
Choose
Assign the
Generate the λ values for the Poisson distribution using the following formula:
Randomly generate
Fit a Poisson GMIFS model and fit a glmpath model.
Repeat steps 1–6
This simulation method was adjusted in several places. In this case, we chose
The number of true predictors that have a nonzero coefficient;
the number of false predictors that have a nonzero coefficient;
accuracy of count predictions from the model (sum of squared residuals) when applied to an independent test set.
The methods were compared with and without the use of an offset during the simulation process. Furthermore, the glmpath method allows for the use of Gaussian and Poisson distributions. Thus, those options were also used to see what effects user error had on the results. Thus, a total of three models were compared when the true distribution was Poisson:
Poisson GMIFS model;
glmpath using “poisson” family option and λ2 = 0 which fits a LASSO model; and
glmpath using “gaussian” family option and λ2 = 0 which fits a LASSO model.
Results
Simulations
Simulations were performed as described in “Simulations” of the Methods section, and Figures 1–3 show the results of the simulations. Figure 1 shows the distribution of the number of predictors correctly identified as nonzero over 100 simulations and the types of models used. The data were generated using both

Number of predictors correctly identified as nonzero. This figure shows the distribution of the number of predictors correctly identified as nonzero over 100 simulations. There were five predictors that were set as nonzero. Boxplots are separated by the type of distribution used to generate the data and the number of observations.
Figure 2 shows the distribution of the number of predictors incorrectly identified as nonzero over 100 simulations and the types of models used. The data were generated using both

Number of predictors incorrectly identified as nonzero. This figure shows the distribution of the number of predictors incorrectly identified as nonzero over 100 simulations. There were 95 predictors for which their coefficients were set to zero. Boxplots are separated by the type of distribution used to generate the data and the number of observations.
Figure 3 shows the distribution of the sum of residuals squared as a measure of the model prediction accuracy. The data were generated using both

Accuracy of count predictions. This figure shows the distribution of the sum of residuals squared over 100 simulations using a learning data set and an independent test data set. Boxplots are separated by the type of model fit to the data and the number of observations. The results for glmpath with Gaussian family using an offset are not displayed because both values are above 50000.
Gene Expression Analysis
Both GMIFS and glmpath models were applied to the cord blood gene expression data set described under “Data” of Methods section. For glmpath, the Poisson family option was used and the lambda2 option was set to zero. For GMIFS, the default options were chosen. The response in the model was MN counts, and the predictors were the gene expression intensities. Gender was included in the model as part of the unpenalized subset. Based on Figure 4, a Poisson distribution was assumed for both models because the data appear skewed. The final model parameters were chosen using the minimum Akaike information criterion. The GMIFS model identified 17 nonzero gene expression coefficients as associated with MN count and the glmpath with Poisson family identified 23. Out of the genes that were identified, 10 were common to both models. Figures 5 (sum of squared residuals = 101.7) and 6 (sum of squared residuals = 1.8) show that both models seem to predict MNs relatively well. Table 1 shows the genes that both models identified as being associated with MN count and the types of cancer with which they are linked. Nine out of the 10 genes in common between both models are linked to some type of cancer.

Histogram of MN counts.

Plot of actual MN counts versus predicted MN counts using GMIFS.

Plot of actual MN counts versus predicted MN counts using glmpath.
Genes identified as associated with MN count by both GMIFS and glmpath.
Discussion
We have described the GMIFS method for modeling a count response when we want to (1) coerce some variables into the model and (2) perform automatic variable selection and model estimation by penalizing predictors. High-throughput data contain more predictors than there are samples, so traditional methods are not appropriate in this setting. The GMIFS method was compared to glmpath, a popular penalization algorithm. Simulations showed that both methods performed similarly when identifying predictors known to be nonzero. GMIFS appeared to slightly outperform glmpath in the sense that GMIFS included fewer predictors that are truly unimportant in the model. Similarly, when applied to an independent data set, GMIFS appeared to have higher predictive accuracy. Thus, it appears that GMIFS is more generalizable than glmpath to independent data sets.
Finally, both methods were applied to a cord blood gene expression data set. Gene expression profiles were used to predict MN frequency. Both models identified a similar number of genes as related to MN frequency. Further, 10 of those genes were common to both models. Nine out of the 10 genes have been shown to be associated with different types of cancers. Because MN count is a measure of DNA damage, genes associated with MN frequency would be expected to be linked to cancer.
Both models appear to identify genes linked to cancer. As in the simulations, glmpath identified more genes as nonzero compared to GMIFS. In the simulations, this was because glmpath was including more predictors incorrectly. However, there is no way to know whether this is also the case in the cord blood data set, given that these data are observational and no further confirmatory studies can be performed on the samples.
Author Contributions
Conceived and designed the experiments: KJA. Analyzed the data: MSM. Wrote the first draft of the manuscript: MSM. Contributed to the writing of the manuscript: KJA. Agree with manuscript results and conclusions: MSM, KJA. Jointly developed the structure and arguments for the paper: MSM, KJA. Made critical revisions and approved final version: KJA. Both authors reviewed and approved of the final manuscript.
