Abstract
Introduction
In recent years, significant advances in both the genotyping and computing fields have enabled us to perform large-scale genome-wide association studies (GWAS).1,2 GWAS have demonstrated enormous potential in identifying genetic variants for common complex diseases.
In GWAS, regression analysis, including both logistic regression for binary phenotypes and linear regression for quantitative phenotypes, has been one of the most powerful methods to detect associations between genetic variants and target phenotypes. It has successfully identified novel genetic variants, such as single nucleotide polymorphism (SNP), for many complex diseases, including type 2 diabetes,3–5 bipolar disorder,6,7 and various cancers.8–10
However, these applications are mainly limited to the investigation of association between a single variant and target phenotype, not that between multiple variants and a target phenotype. Note that the association tests are usually performed by fitting a linear or logistic regression model. Although the regression analysis requires inversion and multiplication of matrices with high computational complexity, the computational burden is not so tremendous for the single variant analysis. However, in the case of gene-gene interaction (GGI) analysis in an exhaustive manner, the tremendous computational burden renders regression analysis infeasible, because the number of possible tests is extremely large, say
In order to overcome this computational burden and identify GGIs within feasible computational time, many approaches have been suggested. They can be classified into two main approaches. One is to reduce the number of possible tests using a variable selection method, and the other is to develop an accelerated algorithm. Variable selection is most commonly used. For example, sure independence screening (SIS) is the most popular variable selection strategy.
11
Ueki and Tamiya proposed an algorithm to identify GGI with SIS variable selection strategy.
12
By selecting the variables in the most parsimonious way, it was shown to be possible to identify more GGIs than other competing methods. However, the variable selection approach still has a chance to miss the true GGI if the variable screened out has a true interaction with the other variable. By motivating from this issue, many researchers have focused on the identification of GGI in an exhaustive manner with a high speed. For example, Wan et al proposed BOolean Operation-based Screening and Testing (BOOST) method
13
in order to make the identification of two-way GGI possible by using the Boolean operation method. However, since this method is based on the fast calculation via Boolean operation, GGIs that cannot be represented via the result of their XOR bitwise manipulation are difficult for BOOST to detect. Alternatively, Kwon et al proposed cuGWAM
14
in order to make the identification of
Although BOOST and cuGWAM are very powerful methods for GGI analysis, they are not as flexible as regression analysis in dealing with individual covariates; however, regression-based GGI analysis has not been successfully performed because of the computational burden described earlier. Thus, an acceleration of regression-based GGI analysis is required in order to make an exhaustive investigation computationally feasible. In this work, we thus develop a new regression-based toolkit, CARAT-GxG, which dramatically accelerates the performance of regression analysis of GGI. CARAT-GxG can find an association between a phenotype and single SNP or between several phenotypes and two SNPs with interaction, while adjusting for covariates. CARAT-GxG uses logistic regression for binary phenotype and linear regression for quantitative phenotype. By substantially accelerating the speed of regression-based GGI analysis using GPU, we achieved an exhaustive investigation of GGI from GWAS dataset within a couple of weeks, which was not possible in the traditional analyses. Furthermore, CARAT-GxG can be accelerated by CUDA (compute unified device architecture) using a high-throughput GPU and can be applied to GPU computing systems via TORQUE Resource Manager (http://www.adaptivecomputing.com/products/open-source/torque), the GPU computing architecture.
In this paper, we describe in detail how regression analysis process can be implemented using GPUs and how much acceleration can be achieved by GPU implementations compared to traditional central processing unit (CPU) implementations. In addition, we describe the factors that affect the performance and accuracy of GPU programs, and evaluate how much these factors affect the performance of CARAT-GxG. Next, we demonstrate how the several adjustable parameters related to regression analysis and GPU-related computational issues affect the accuracy of the results and the execution time. Finally, we introduce a way to apply a GPU computing system with CARAT-GxG and summarize the improvements achieved by using the GPU computing system. CARAT-GxG is freely available from the software website (http://bibs.snu.ac.kr/software/caratgxg).
Methods and Implementation
CARAT-GxG was implemented in the C/C++ language within a CUDA environment. For performance comparisons, R package and PLINK were used.
Implementation of analysis method
CARAT-GxG is capable of performing all possible SNP combinations in a regression model, including all possible pair-wise SNP–SNP interactions (eg, a dataset with 10,000 SNPs has 49,995,000 SNP–SNP pair-wise interaction combinations in addition to 10,000 single SNP combinations). The SNP variables, including main and interaction effects, can be treated either as nominal or ordinal variables. Depending on the phenotype, CARAT-GxG is implemented by using either logistic or linear regression.
For the continuous phenotypes, CARAT-GxG builds a linear regression model and estimates the parameters using the ordinary least squares (OLS) method for each SNP–SNP combination. This OLS estimation does not require any iterative processes. The significance of model parameters is evaluated via
GPU implementation
The GPU is a circuit specialized for graphical processing, which usually requires large-scale parallelism and floating-point calculations. Thus, GPU has successfully been used to implement several key methods in bioinformatics, such as GGI analysis, 14 sequence alignment,16,17 and database searching. 18 These GPU applications have been developed using CUDA by Nvidia, Stream SDK by ATI, or in recent years, OpenCL. In order to implement our regression-based GGI application, we chose the CUDA architecture as our development environment.
Generally, a different implementation strategy is required when developing an application that uses the GPU rather than the CPU because of differences in the underlying architecture such as in the memory structure or processing pipeline. These differences make it impossible to port many traditional algorithms based on the CPU to the GPU architecture. Thus, we developed a new implementation strategy to perform regression analysis using the GPU. A key feature of our CARAT-GxG implementation is the ability to execute multiple combinations concurrently. Our implementation consists of the following steps, as shown in Figure 1:
As a preliminary step, load the dataset into each GPU memory. Enumerate all possible two-way combinations. Assign the combinations into the GPU memory. Launch a GPU kernel with a given number of combinations from the CPU. Distribute the combinations to threads that are included in each block. The number of threads is automatically determined to maximize performance. Fit a statistical model in each thread, with a given combination through the GPU-optimized algorithm. Calculate the Repeats step 3–7 until all combinations are processed.

Execution flow of CARAT-GxG. The blue arrow indicates the execution flow between CPU and GPU. The upper and lower sides indicate the task of CPU and GPU, respectively. Data transfer between CPU and GPU is illustrated by an arrow, with description. Both tasks at the same time point can be executed concurrently.
In order to accomplish optimal performance using the GPU, many aspects must be considered. Since the main computational burden of regression analysis occurs during matrix calculation, an optimized access of memory, which highly varies by the model of graphics card, is essential to minimize race conditions. CARAT-GxG automatically selects the most appropriate parameter of GPU execution. In order to determine this parameter, a very naïve but fast approach is applied; it is achieved via a sequential test of equally spaced candidates of optimal parameters. As shown in Figure 2C, a concave trend of execution time along with the parameter value justifies this approach.

Results of CARAT-GxG performance assessment. (A) The dotted line indicates the theoretical acceleration folds by adding a graphics card. The solid line indicates measured acceleration folds in two against one (green) and three against one (blue) graphics cards. (B) Execution time between CARAT-GXG and CPU implementations in a single SNP test. (C) Execution time of CARAT-GXG according to the number of threads and blocks with the dataset including 1,000 samples with 500 SNPS.
Methods for performance comparison
Typically, R package and PLINK are used in order to perform regression analysis of GWAS data. Thus, a performance comparison between the former tools and CARAT-GxG is essential. In addition, it is important to note the differences arising from execution on the GPU versus the CPU. Some of these differences are as follows:
Calculation architecture and conditions: CPU versus GPU, one way and two way. Acceleration by the number of GPUs: one, two, and three GPUs. Difference of accuracy between GPU and CPU implementations: this arises because the CPU and GPU treat floating-point numbers differently. Performance during repeated iterations of the NR method: it is essential to see how the number of iteration affects the speed and result. Acceleration by the application of CARAT-GxG to the GPU system.
While the ultimate purpose of regression analysis on GWAS data is to find statistically significant associations between variants and target phenotype, the results should also have biological significance. Thus, it is important to evaluate their biological significance as well. For this reason, an additional parameter to evaluate the biological significance of identified SNP–SNP combinations is required. It can be determined in many ways, such as through a literature search or biological data mining.
In order to assess the analysis result of CARAT-GxG, we analyzed a real GWAS dataset from an age-related macular degeneration (AMD) study, which has several validated GGI. 19 For AMD dataset, several validated association results, including interactions, have been already reported in studies.20,21 The AMD dataset contains a binary phenotype indicating the status of AMD. The genotypes of 146 samples with 105,980 SNPs were available, which passed quality control process.
GPU computing system setup
Deploying GPUs to computing system infrastructures is unpopular because the method of implementing general-purpose computing on graphics processing units (GPGPU) differs depending on the vendors of graphic cards. However, CUDA has now become a de facto standard for GPGPU. Several open-source-based parallel computing infrastructures, such as TORQUE or Ganglia Monitoring System, now support the integration of CUDA technology. For these reasons, we developed our CARAT-GxG to perform grid computing via TORQUE, an open-source-based infrastructure enabling the control over batch jobs and distributed computing resources.
In order to evaluate the effectiveness of the GPU computing system with CARAT-GxG, an appropriate setup for a given GPU computing system is required to run CARAT-GxG. All setup and comparisons were performed on both the stand-alone GPU system and the GPU cluster. The stand-alone GPU system contains an Intel Core i7–950 CPU and three NVIDIA GTX480 graphic cards, while the GPU cluster consists of four nodes, and each node includes two physical CPU cores with 24 threads and eight NVIDIA Tesla M2070 GPUs.
According to the system setup of TORQUE, CARAT-GxG automatically generates sequences of tasks that are independently queued by TORQUE and executed. Since the queued tasks are automatically distributed by TORQUE, GPU utilization can be maximized.
Results and Discussion
All performance comparisons between CARAT-GxG and CPU implementations were conducted in a stand-alone GPU system with an Intel Core i7–950 CPU and three NVIDIA GTX480 graphic cards. Owing to enormous time consumption of the CPU version of the R code, the whole execution time of the CPU had to be estimated from the execution time of lower (1,000) combinations in each dataset. In order to perform the comparisons, we simulated a simple dataset consisting of varying numbers of SNPs from 100 to 5,000 and samples from 100 to 1,000 with randomly generated binary phenotype. The genotypes in the dataset were also randomly generated with diverse minor allele frequencies (MAFs) using the uniform distribution between 0.05 and 0.5.
It is important to ensure maximum utilization of the GPU because the performance of GPU is highly dependent on the parameters required by GPGPU. In this context, we investigated the relationship between the analysis efficiency of CARAT-GxG and two parameters given upon executing GPU: the number of threads and the number of blocks. As shown in Figure 2C, it is clear that there is a concave trend as the number of blocks varies. In contrast, it is relatively easy to determine the number of threads, because the number of optimal threads is usually identical to the warp size of the GPU. However, it is difficult to determine the optimal number of blocks to maximize performance because of its dependence on the underlying GPU. Hence, we developed CARAT-GxG to attempt to automatically determine the appropriate number of blocks and threads.
The comparisons between CARAT-GxG and traditional CPU implementations, as shown in Table 1 for SNP–SNP interaction analysis and Figure 2B for single SNP analysis, found that CARAT-GxG performs up to almost 700 times faster. More specifically, our comparison showed that the speedup stabilized as the number of SNPs or individuals increased. The low improvement when evaluating 100 SNPs in Table 1 is caused by the overhead during the initialization and additional processes, such as data transfer between the CPU and GPU or the storage of intermediate results.
Execution time of CARAT-GxG and CPU implementation in two-way testing (UNIT: SECOND).
From the comparison of acceleration by adding a graphics card to the execution, CARAT-GxG showed steady performance improvement, as shown in Figure 2A. Note that as the size of dataset increases, the degree of efficiency of parallelization becomes crucial. This efficiency was in the range of 95–99% compared to the ideal performance as the size of dataset increased. In addition, CARAT-GxG performed strongly even when using more complex models with additional variables such as interaction and covariates. As shown in Table 1 and Figure 2B, CARAT-GxG outperformed the CPU implementations when calculating the two-way model including interaction.
When comparing the reliability represented by
The other comparison was made from the difference of rank, sorted by

The red, blue, and green lines indicate the number of combinations that change in rank, vanish, and do not change in rank as the number of iterations increases, respectively.
We further evaluated the performance gain of CARAT-GxG on our GPU cluster. In the evaluation, the total analysis time was almost inversely proportional to the number of nodes requested for use by CARAT-GxG, indicating that our approach is highly suitable to implement regression analysis using GPUs with a GPU computing system. In brief, CARAT-GxG on our GPU cluster was capable of performing an exhaustive two-way linear regression analysis with the GGI of 500 K SNP chip in 12 days.
Finally, we tested CARAT-GxG with a real GWAS dataset from an AMD study.
19
We found that the top two combinations have already been identified in other studies,20,21 and rs380390 is found in the top three combinations except the lowest
A list of top five significant two-way combinations from AMD data.
In the analysis of biological data using a bioinformatics approach, the size of data and required computing power are usually very large. Thus, the performance of a given analysis method is a critical factor in providing results for further analysis. However, as the method becomes more complex, the time required to perform analysis increases rapidly.
Motivated by these issues, we posit that GPU programing has the possibility to dramatically increase the efficiency of analysis. Its parallelism is more suited to many bioinformatics methods, which require massive amounts of independent calculation. The benefit of this was evident in our almost 700fold execution speed performance.
In addition, CARAT-GxG showed stable performance increment by adding a new graphics card, through our GPU-specified optimization and performance tuning processes. As shown in Table 1, researchers who use the CARAT-GxG can easily estimate and control the analysis time. We expect that this predictability can provide an advantage to researchers.
From the aspect of GPU programing, we developed CARAT-GxG to gain more performance and provide more usability by parameter optimization; adjusting the number of blocks and threads is the key to optimize performance. For example, if there are 480 cores that can be utilized in GPU, running a single kernel with more than 480 blocks would take much more time, because some blocks waiting to be executed would remain. Simulation results suggest that we should take these attributes of the GPU into account. We confirmed it before the comparison, by executing CARAT-GxG with various numbers of blocks and threads. It suggested that the execution with appropriate numbers of blocks and threads can boost the performance almost twofold.
Finally, we achieved additional acceleration via an application of a GPU system using TORQUE and showed that our task-based approach analyzed the given dataset in time almost inversely proportional to the number of requested nodes by CARAT-GxG. However, there is a room for achieving more acceleration, because we still do not consider the difference between requested nodes, which can cause speed deceleration.
Cconclusion
Recently, a number of huge sequencing projects have delivered hundreds of completely mapped sequences, fueled by rapid advancements in sequencing technology. Consequently, the number of available variants has increased significantly. Inevitably, the development of a more computationally efficient method is required in order to investigate the GGI in large datasets. In this aspect, an extensive application of grid computing systems can considerably accelerate the speed of analysis. However, traditional CPU-based grid computing systems are essentially not suitable to perform a huge number of independent tasks. However, the application of GPU can overcome this drawback, with the high-throughput of parallel tasks and computational efficiency. We took advantage of a GPU-based computing system and developed CARAT-GxG in order to address such parallelism and efficiency issues. As a result, we successfully showed that a GPU-based computing system could effectively handle the GGI analysis of large-scale GWAS data in the regression analysis framework in an exhaustive manner. For our best knowledge, it is the first attempt to provide a toolkit that enables an exhaustive investigation of regression-based GGI in GWAS. In contrast to the variable selection-based methods, CARAT-GxG does not screen out any variant from the dataset and, thus, it has a less chance of missing the true interactions.
In addition, we also showed that there are many aspects that GPU analysis can account for, such as the number of blocks. Since regression analysis using GPU is strongly dependent on the frequent access of global memory, the optimal number of blocks is necessarily varied. By optimal allocation adjustment of the number of blocks, CARAT-GxG is dynamically tuned and achieves maximal efficiency. This kind of optimization certainly should be included in many heterogeneous systems, especially in future GPU-based computing systems.
In conclusion, we successfully showed that an exhaustive two-way GGI analysis using regression analysis can be achieved within one-and-a-half weeks in a small GPU computing system. From the comparison of traditional toolkit on a standalone GPU system, we showed that our toolset provides up to 700-folds of acceleration. This fast acceleration enables an investigation of GGI in GWAS dataset in an exhaustive manner. We expect our CARAT-GxG to provide good performance and be practically applicable to the GGI analysis of whole-genome scale datasets. Since the CARAT-GxG supports both continuous and binary phenotypes, it could be applied to many GWAS dataset with various types of phenotypes. We also expect that CARAT-GxG can be easily adapted to future large-scale GPU-based computing systems because of its high scalability.
Author Contributions
Contributed to the writing of the manuscript: SL, MSK. Jointly developed the structure and arguments for the paper: SL, MSK, TP. Contributed to the writing of the manuscript: SL, TP. Agree with manuscript results and conclusions: MSK. All authors reviewed and approved of the final manuscript.
