Sage Journals: Discover world-class research

Abstract

Solving an N-body problem, electrostatic or gravitational, is a crucial task and the main computational bottleneck in many scientific applications. Its direct solution is an ubiquitous showcase example for the compute power of graphics processing units (GPUs). However, the naïve pairwise summation has $O (N^{2})$ computational complexity. The fast multipole method (FMM) can reduce runtime and complexity to $O (N)$ for any specified precision. Here, we present a CUDA-accelerated, C++ FMM implementation for multi particle systems with $r^{- 1}$ potential that are found, e.g. in biomolecular simulations. The algorithm involves several operators to exchange information in an octree data structure. We focus on the Multipole-to-Local (M2L) operator, as its runtime is limiting for the overall performance. We propose, implement and benchmark three different M2L parallelization approaches. Approach (1) utilizes Unified Memory to minimize programming and porting efforts. It achieves decent speedups for only little implementation work. Approach (2) employs CUDA Dynamic Parallelism to significantly improve performance for high approximation accuracies. The presorted list-based approach (3) fits periodic boundary conditions particularly well. It exploits FMM operator symmetries to minimize both memory access and the number of complex multiplications. The result is a compute-bound implementation, i.e. performance is limited by arithmetic operations rather than by memory accesses. The complete CUDA parallelized FMM is incorporated within the GROMACS molecular dynamics package as an alternative Coulomb solver.

Keywords

Fast multipole method Multipole-to-Local molecular dynamics electrostatics CUDA

1 Introduction

The fast multipole method (FMM) was introduced by Greengard and Rokhlin (1987) to efficiently evaluate pairwise, Coulombic or gravitational, interactions in many body systems, which arise in many diverse fields like biomolecular simulation (Dror et al., 2012; Hansson et al., 2002), astronomy (Arnold et al., 2013; Potter et al., 2017) or plasma physics (Dawson, 1983). Moreover, the FMM can improve iterative solvers for integral equations by speeding up the underlying matrix-vector products (Engheta et al., 1992; Gumerov and Duraiswami, 2006).

The originally proposed FMM uses a spherical harmonics representation of the inverse distance $r^{- 1}$ between particles. For distant interactions (far field), which can be strictly defined, it uses multipole expansions built by clustered particle groups. The expansions are shifted and then transformed into Taylor moments by applying linear operators in a hierarchical manner to achieve linear scaling with respect to the number of particles. The complexity of the operators is $O (p^{4})$ , where p is the order of the multipole expansion. White and Head-Gordon (1996) and Greengard and Rokhlin (1997) proposed rotational operators, that align the transformation axis to reduce the operator complexity to $O (p^{3})$ . Cheng et al. (1999) used plain wave expansions to further reduce the complexity of the operators to $O (p^{2})$ , however a few $O (p^{3})$ translations are still required. The algorithm has been developed further to support oscillatory kernels $e^{i k r} / r$ (Ying et al., 2004). Fong and Darve (2009) parametrized the inverse distance function using Chebyshev polynomials and proposed a “black-box” FMM, which uses a minimal number of coefficients to represent the far field.

In atomistic molecular dynamics (MD) simulations, Newton’s equations of motion are solved for a system of N particles (Allen and Tildesley, 1989) in a potential that accounts for all relevant interactions between the atoms. The integration time step is limited to a few femtoseconds such that the fastest atomic motions can be resolved. To reach the time scales many biomolecules operate on, millions of time steps need to be computed (Bock et al., 2013; Lane et al., 2013; Paul et al., 2017; Schwantes et al., 2014). This can easily require weeks or even months of compute time even on modern hardware (Kutzner et al., 2019). Hence, to speed up the calculation of an MD trajectory, the execution time for each individual time step has to be reduced. This can be achieved with better algorithms, with special-purpose hardware (Shaw et al., 2014), by introducing heterogeneous parallelization, e.g. harnessing SIMD, multi-core, and multi-node parallelism (Abraham et al., 2015; Hess et al., 2008; Páll et al., 2015) and by using GPUs (Páll and Hess, 2013; Salomon-Ferrer et al., 2013). Here, we utilize GPUs for our FMM implementation.

The electrostatic contribution to the inter-atomic forces is governed by Coulomb’s law

F_{i j} = \frac{1}{4 π ε_{0}} \frac{q_{i} q_{j}}{{‖ r_{i j} ‖}_{2}^{2}} \frac{r_{i j}}{{‖ r_{i j} ‖}_{2}}

where $r_{i j} = x_{i} - x_{j}$ is a vector distance between atoms carrying partial charges q_i , q_j at positions $x_{i}$ , $x_{j}$ , $ε_{0}$ is the vacuum permittivity and ${‖ \cdot ‖}_{2}$ is the Euclidean distance. The calculation of nonbonded forces, i.e. Coulomb and van der Waals forces, is usually by far the most time-consuming part of an MD step. The van der Waals forces decay very quickly with distance r, so calculating them up to a cutoff distance suffices. The Coulomb forces, however, decay only quadratically with r, and the use of a finite Coulomb cutoff can therefore lead to severe simulation artifacts (Patra et al., 2003; Schreiber and Steinhauser, 1992). Direct evaluation of the electrostatic interactions in a typical biomolecular simulation system becomes prohibitive for two reasons. First, the $O (N^{2})$ scaling of a direct evaluation hinders its usage already at small system sizes, e.g. for $N \approx 50, 000$ particles. Second, the usually employed periodic boundary conditions (PBC) make such calculation even impossible. Biomolecular simulation, therefore, requires an efficient Coulomb solver that properly accounts for the full, long-range nature of the electrostatic interactions.

To this aim, several FMM implementations have been developed. A standard FMM was included as an electrostatic solver for the NAMD package (Board et al., 1992; Nelson et al., 1996). Ding et al. (1992a) proposed the Cell Multipole Method (CMM) to simulate polymer systems of up to 1.2 million atoms. In further work, they combined CMM with the Ewald method showing a considerable speedup with respect to a pure Ewald treatment (Ding et al., 1992b). Niedermeier and Tavan (1994) introduced a structure-adapted multipole method. Eichinger et al. (1997) combined the structure-adapted multipole method with a multiple-time-step algorithm. Andoh et al. (2013) developed MODYLAS, a FMM adoption for very large MD systems and benchmarked it on the K-computer using 65,536 nodes. Very recently, it was extended to support rectangular boxes (Andoh et al., 2020). Yoshii et al. (2020) developed a FMM for MD systems with two-dimensional periodic boundary conditions. Shamshirgar et al. (2019) implemented a regularization method for improved FMM energy conservation. Gnedin (2019) combined fast Fourier transforms (FFTs) and the FMM for improved performance.

Considering efficient parallelization approaches, Gumerov and Duraiswami (2008) pioneered the GPU implementations of the spherical harmonics FMM with rotational operators. Depending on accuracy they achieved speedups of 30–70 with respect to a single CPU. Different GPU parallelization schemes for the “black-box” FMM (Fong and Darve, 2009) were implemented by Takahashi et al. (2012). Yokota et al. (2009) parallelized ExaFmm on a GPU cluster with 32 GPUs achieving a parallel efficiency of 44% and 66% for $10^{6}$ and $10^{7}$ particles, respectively. Even more GPUs (256) were used by Lashuk et al. (2009) for a system of 256 million particles. Rotational based Multipole-to-Local operators were efficiently parallelized with GPUs by Garcia et al. (2016). Task-based parallelization approaches to the FMM were proposed in Blanchard et al. (2015) and Agullo et al. (2016). A review of fast multipole techniques for calculation of electrostatic interactions in MD systems can be found in Kurzak and Pettitt (2006).

However, the early adoptions of FMMs in MD simulation codes were mostly superseded by particle Mesh Ewald (PME) (Essmann et al., 1995) due to its higher single-node performance. As a result, PME currently dominates the field. It is based on the FFT, which inherently provides the PBC solution. Nevertheless, PME suffers from a scaling bottleneck when parallelized over many nodes, as the underlying FFTs require all-to-all communication (Board et al., 1999; Kutzner et al., 2007, 2014). In addition, large systems with nonuniform particle distributions become memory intensive, since PME evaluates the forces on a uniform mesh across the whole computational domain.

In the era of ever-increasing parallelism and exascale computers, it is time to revisit the FMM, which does not suffer from the above mentioned limitations. To this end, we implemented and benchmarked a single-node full CUDA parallel FMM. Our implementation has been tailored for MD simulations, i.e. it targets a millisecond order runtime for one MD step by careful GPU parallelization of all FMM stages and by optimizing their flow to hide possible latencies. It was also meticulously integrated into the GROMACS package to avoid additional FMM independent performance bottlenecks. Here, we present three different parallelization approaches. The implementation is based on the ScaFaCos FMM (Arnold et al., 2013), which utilizes spherical harmonics to describe the $r^{- 1}$ function. We use octree grouping to describe the interaction hierarchy. Such grouping is achieved by recursive subdivision of the cubic simulation box into eight equal subboxes. It has a major advantage: the far field operators can be precomputed for the whole simulation box, allowing for efficient parallelization. Additionally, the PBC computation becomes negligible as the PBC operators reduce to a single operator appliance. Moreover, a strict error control of the approximation (Dachsel, 2010) can be applied.

Here, we focus on the CUDA parallelization of the Multipole-to-Local (M2L) operator, which is most limiting to the overall FMM far field performance. An overview of the parallelized FMM, including all stages and complete runtimes, can be found in Kohnke et al. (2020b).

2 The fast multipole method

We consider a system of $N ≫ 1$ particles. Following Hockney and Eastwood (1988), the challenge is to most efficiently evaluate

Φ (x_{j}) = \sum_{\begin{matrix} k = 0 \\ k \neq j \end{matrix}}^{N - 1} \frac{q_{k}}{{‖ x_{j} - x_{k} ‖}_{2}}, j = 0, ..., N - 1

where $x_{j}$ and $x_{k}$ are positions of particles j and k, respectively and q_k is the charge of the k-th particle. For a direct solution of Eq. (2), interactions between all pairs of particles $(j, k)$ with $j \neq k$ need to be computed. This leads to two nested loops and $O (N^{2})$ calculation steps.

2.1 Mathematical foundations

Expansion of the inverse distance between arbitrary particles $x_{j}$ and $x_{k}$ , $j \neq k$ yields

\frac{1}{{‖ x_{j} - x_{k} ‖}_{2}} = \sum_{l = 0}^{\infty} \sum_{m = - l}^{l} \frac{{‖ x_{j} ‖}_{2}^{l}}{{‖ x_{k} ‖}_{2}^{l + 1}} Y_{l m}^{*} (θ_{j}, ϕ_{j}) Y_{l m} (θ_{k}, ϕ_{k})

where

Y_{l m} (θ, ϕ) : = \sqrt{\frac{2 l + 1}{4 π}} \sqrt{\frac{(l - m)!}{(l + m)!}} P_{l m} (cos θ) e^{i m ϕ}

are spherical harmonics and $Y^{*}$ their complex-conjugate, $θ$ and $ϕ$ are polar and azimuthal angle, respectively, and $P_{l m}$ are the associated Legendre polynomials

P_{l m} (y) : = (- {1)}^{m} {(1 - y^{2})}^{m / 2} \frac{d^{m}}{d y^{m}} P_{l} (y)

where P_l are ordinary Legendre polynomials

P_{l} (y) : = \frac{1}{2^{l} l!} \frac{d^{l}}{d y^{l}} {(y^{2} - 1)}^{l}

The normalized associated Legendre polynomials form an orthonormal set of basis functions on the surface of a sphere. The j-th and k-th dependent parts of the right hand side of Eq. (3) are chargeless multipole moments

{\overset{º}{ω}}_{l m}^{j} : = {\overset{º}{ω}}_{l m}^{j} (x_{j}) : = \frac{{‖ x_{j} ‖}_{2}^{l}}{(l + m)!} P_{l m} (cos θ_{j}) e^{i m ϕ_{j}}

and chargeless local moments

{\overset{º}{μ}}_{l m}^{k} : = {\overset{º}{μ}}_{l m}^{k} (x_{k}) : = \frac{(l - m)!}{{‖ x_{k} ‖}_{2}^{l + 1}} P_{l m} (cos θ_{k}) e^{i m ϕ_{k}}

respectively. The moments weighted with corresponding charges q_j and q_k , respectively, can be summed yielding charged multipole moments

ω_{l m} : = \sum_{j = 0}^{J - 1} q_{j} {\overset{º}{ω}}_{l m}^{j}

and charged local moments

μ_{l m} : = \sum_{k = 0}^{K - 1} q_{k} {\overset{º}{μ}}_{l m}^{k}

This allows to evaluate the potential at arbitrary particles at $x_{j}$ , $j = 0, ..., J - 1$ due to a distant discrete charge distribution of K particles with positions $x_{k}$ , $k = 0, ..., K - 1$ in terms of charged local moments and chargeless multipole moments with

Φ (x_{j}) = \sum_{l = 0}^{\infty} \sum_{m = - l}^{l} μ_{l m} {\overset{º}{ω}}_{l m}^{j}

This calculation is referred to as far field. To achieve convergence in Eq. (3),

{‖ x_{j} ‖}_{2} < {‖ x_{k} ‖}_{2}

has to be fulfilled for all distinct index pairs j and k. Application of addition theorems for regular and irregular solid harmonics (Tough and Stone, 1977) yields translation and transformation operators for the expansions. The moments $ω_{l m}$ of a multipole expansion about a common origin $a$

ω (a) : = \sum_{l = 0}^{\infty} \sum_{m = - l}^{l} ω_{l m}

of particles $x_{j}$ , $j = 0, ..., J - 1$ can be translated to a new origin $a'$ with

ω_{l m} (a') = \sum_{j = 0}^{l} \sum_{k = - j}^{j} ω_{j k} (a) A_{l - j, m - k} (a - a')

where $A \equiv \overset{º}{ω}$ is the Multipole-to-Multipole operator. Further, the multipole expansion $ω (a)$ can be transformed into a local expansion $μ (r)$ at $r$ with ${‖ r ‖}_{2} > {‖ x_{j} ‖}_{2}$ , $j = 0, ..., J - 1$ with

μ_{l m} (r) = \sum_{j = 0}^{\infty} \sum_{k = - j}^{j} ω_{j k} (a) M_{l + j, m + k} (a - r)

where $M \equiv \overset{º}{μ}$ is the Multipole-to-Local operator. Finally, the local expansions $μ (r)$ can be translated to any point $r'$ with

μ_{l m} (r') = \sum_{j = l}^{\infty} \sum_{k = - j}^{j} μ_{j k} (r) C_{j - l, k - m} (r - r')

where $C \equiv \overset{º}{ω}$ is the Local-to-Local operator.

2.2 Algorithm

Applying the operators defined in the previous section requires truncation of Eq. (3) to a finite multipole order p, which controls the accuracy of the solution approximation. Such expansions have a triangular shape with indexing shown in Figure 1. The truncation yields

ω, μ, A, C \in K^{p \times p} : = {{(a_{l m})}_{l = 0, ..., p, m = - l,..., l} | a_{l m} \in ℂ}

and $M \in K^{2 p \times 2 p}$ . The translations and transformations defined in the Mathematical foundations section are performed on moments expanded for clusters of particles. The clustering is based on a hierarchical partition of the computational domain $Ω : = [0, ℓ]^{3} \in ℝ^{3}$ into $8^{d}$ boxes for $d = 0, ..., D$ , where $D$ is a predefined parameter. It leads to an octree of depth $D$ shown in Figure 2. Figure 3 illustrates the six main stages of the FMM and their execution order. In the Particle-to-Multipole (P2M) stage, the particles occupying boxes on the deepest level $D$ of the octree are expanded to multipoles $ω$ with respect to the center of the boxes. In the Multipole-to-Multipole (M2M) stage, the expanded moments $ω$ are distributed to all boxes of the octree by translating them level-wise from d to $d - 1$ , $d = D,...,1$ with the $A$ operator. Subsequently, during the Multipole-to-Local (M2L) stage, the multipole moments $ω$ are transformed into local moments $μ$ by applying the operator $M$ . A detailed description follows in the next section. After M2L, in the Local-to-Local (L2L) stage, the local moments $μ$ are shifted from the octree root to the leaves with the $C$ operator. The interactions between particles occupying the same lowest level boxes and between neighboring boxes (near field) are evaluated directly (P2P), Eq. (2). Finally, the far field forces and potentials are evaluated at particle positions in the tree leaves. Additionally, the number of directly interacting boxes can be defined with a well separation criterion, which controls how many layers of adjacent boxes interact directly on the lowest tree depth. In the following we will only discuss the case with one well separated layer of boxes.

Figure 1.

Indexing of triangular shaped matrices. The used indexing scheme is based on standard matrix index notation: The first subscript is a row number, the second one is a column number, which can be negative. In case of $(l, m)$ notation, $l \geq | m |$ and $l \leq p$ .

Figure 2.

2D example of FMM tree depth and resulting box subdivision for depths $D = 0, 1,$ and 2. At $D = 1$ , the whole simulation box (green) is split into four subboxes (blue). At $D = 2$ , each of the subboxes is split again into four smaller boxes (red).

Figure 3.

The six different stages of the FMM with an exemplary execution time distribution at the center. The near field part (P2P, top right corner) can be executed concurrently with the far field (stages 1–5) in a parallel implementation. Green squares indicate the representation by multipoles, light brown squares a representation by local moments, blue squares indicate direct summation.

2.3 The Multipole-to-Local (M2L) transformation

We will now explain the M2L transformation and its execution hierarchy. Let $D$ be a fixed depth of an octree and p a multipole order. The multipole or local expansions in box i at depth $d = 0, ..., D$ will be denoted by $ω_{i}^{d}$ and $μ_{i}^{d}$ , respectively. For each $μ_{i}^{d}$ there exists an interaction set $L_{i}^{d}$ , $| L_{i}^{d} | = 208$ . Figure 4 shows the interaction set $L_{i}^{d}$ , which contains the indices of multipole expansions $ω_{j}^{d}$ in all children boxes of the direct neighbors of $μ_{i}^{d}$ ’s parent box. The direct neighbors share at least one common vertex, edge or face with each other. A particular $μ_{i}^{d}$ is calculated from all $ω_{j}^{d}$ , $j \in L_{i}^{d}$ , omitting $μ_{i}^{d}$ ’s direct neighbor boxes in order to satisfy Eq. (12). This results in 189 $O (p^{4})$ M2L transformations.

Let $M : = {M^{d}}_{d = 1}^{D}$ be the set of level-wise operator sets $M^{d} : = {M_{j \to i}^{d} | M$ transforms j-th multipole moment to i-th local moment at level d}. All M2L operations performed in the FMM octree yield

μ_{i}^{d} = \sum_{\underset{M_{j}^{d} \in M^{d}}{j \in L_{i}^{d}}} M_{j \to i}^{d} ω_{j}^{d}, i = 1, ..., 8^{d}, d = 1, ..., D .

Figure 5 shows one $O (p^{4})$ M2L transformation. It contains $O (p^{2})$ dot products between an $ω$ and a part of the corresponding operator $M$ .

Figure 4.

2D interaction set $L^{d}$ (green) of an arbitrary box with a local moment $μ^{d}$ (light brown). The white boxes do not belong to interaction set $L^{d}$ . The interactions with the light blue boxes need to be skipped as well because they are nearest neighbors.

Figure 5.

One M2L transformation. The matrix-vector like multiplication requires a part of the $M$ operator (red) to calculate one element $μ_{l m}$ (light brown) of a target expansion.

3 Implementation

We focus our GPU parallelization efforts on the M2L operator, as it is the most time-consuming FMM far field operator (see Figure 3 above and Figure 12 in Kohnke et al., 2020b). In PBC, it requires 189 transformations per box, whereas both M2M and L2L, which translate the moments between different tree levels, require only a single transformation per octree box (except for the root box). Since these transformations are of the same complexity, M2L involves $94.5 \times$ the number of operations as M2M and L2L combined. The second most time-consuming part is the P2P near field computation, which will not be discussed in this paper, was optimized as laid out in Páll and Hess (2013). Proper choice of $D$ and p allows to balance the near and far field contribution, which minimizes the overall runtime. In case of parallel implementation these stages can run concurrently.

3.1 CUDA implementation considerations

We will now briefly outline the CUDA programming model, see Nickolls et al. (2008) for details. A typical GPU consists of a few thousand cores that are grouped into larger units called multiprocessors. CUDA threads are organized in blocks. Threads within a block are grouped into subunits called warps, each consisting of 32 threads. For optimal performance, threads within the same warp should execute the same instruction, otherwise the execution is serialized. This type of parallelization is called Single Instruction Multiple Threads (SIMT). Once a block of threads is spawned, it occupies the multiprocessor until the respective computations are completed. Dynamic scheduling is performed warp-wise, thus thread blocks should consist of at least several warps to hide memory and arithmetic latencies within a multiprocessor. Blocks are organized in grids. Each block of a grid and thread of a block is identified with its unique 1D, 2D or 3D index. The dimensions of the grid and the blocks can be chosen independently. To identify threads, CUDA provides the 3D structures gridDim, blockDim, blockIdx and threadIdx.

We will use the following abbreviations: B _α := blockDim.α, G _α := gridDim.α, Bid _α := blockIdx.α and tid _α := threadIdx.α, $α \in {x, y, z}$ . The hierarchy of threads described above affects the memory access and communication between threads. Whereas all threads can access global memory, this access should be minimized as it has a latency of a few hundred cycles. The memory within a block can be shared via shared memory. If no bank conflicts occur fetching shared memory is only slightly slower than register access (20–40 cycles). Synchronization of threads is possible only within a block. Since CUDA-6.0, threads within the same warp are able to share their content via the shuffle instruction by directly sharing their registers.

3.2 Sequential FMM and data structures

Our CUDA implementation is based on a C++11 version of the sequential ScaFaCos FMM (Arnold et al., 2013). It provides class templates with a possibility to use diverse memory allocators. With CUDA Unified Memory (Knap and Czarnul, 2019) the usage of original data structures became feasible by harnessing the C++ memory allocators.

To allow for an efficient manipulation of triangular shaped data (see Figures 1 and 5), we have implemented a dedicated triangular_matrix class that stores the moments and operators. It provides the indexing logic and utilizes a 1D vector of complex values (std:: vector<complex>) for this purpose. For symmetry reasons, it suffices to store one half of the triangular matrix for the moments, as the entries on the left ( $m < l$ ) and right side ( $m > l$ ) are identical except for the signs. The signs are computed on the fly at negligible costs from the parity of indices. The overall size of the matrices depends on the multipole order p. Exploiting symmetry, $(p^{2} + p) / 2$ complex values are stored for the expansions and $((2 p)^{2} + 2 p) / 2$ for the $M$ operator.

Let $D$ be a fixed depth of the tree. Thus, there are $d = 0, ..., D$ levels in an octree. For Multipole-to-Local operations, as described in The Multipole-to-Local (M2L) transformation section, an underlying tree implementation is needed. Listing 1 shows a very basic approach for traversing an octree of depth $D$ . The function index $(x, y, z, d)$ applies the lexicographic approach to compute a unique 1D box index in the octree:

z dim (d) dim (d) + y dim (d) + x + nb (d - 1)

where dim $(d {) : = 2}^{d}$ is the number of boxes in each orthogonal direction and nb $(d) : = \sum_{d = 0}^{D} 8^{d} = ⌊ {(8}^{d + 1}) / 7 ⌋$ is the number of all boxes in an octree of depth d. The parent box index is easily obtained as index $(x / 2, y / 2, z / 2, d - 1)$ .

Listing 2 shows the sequential form of the M2L transformation. The first four for-loops (lines 3–9) traverse the octree as shown in Listing 1. omega and mu store the pointers to triangular_matrix objects for the multipole and local moments, respectively. The next three for-loops determine all multipole expansions $ω_{j}^{d} \in L_{i}^{d}$ that are needed for the calculation of $μ_{i}^{d}$ , $i {= 1, ..., 8}^{d}$ . Figure 6 shows the complete 2D operator set $M$ for the WELL separation criterion $w = 1$ . For each $d = 1, ..., D$ the set $M^{d}$ requires storing 343 pointers. A unique mapping function opindex $(x, y, z)$ returns a 1D index for each $M_{j}^{d} \in M^{d}$ , $j = 0, ..., 343 - 1$ , $d = 1, ..., D$ with

x + 3 + (y + 3) δ + (z + 3) δ^{2}

where $x = x_{ω} - x_{μ}$ , $y = y_{ω} - y_{μ}$ , $z = z_{ω} - z_{μ}$ are the relative positions of $ω$ and $μ$ in 3D and $δ = 3 + 4 σ$ . Here, $σ$ denotes the number of directly interacting box layers according to the well separation criterion. In PBC, any $ω_{j}^{d} \in L_{i}^{d}$ with an out-of-box position is remapped from the corresponding periodic position with the periodic_remapping( $\cdot$ ) function. Listing 3 shows a basic implementation of the M2L ( $\cdot$ ) function, which computes Eq. (15) up to order p in four nested for-loops.

Listing 1.

Loops for traversing an octree in 3D space (pseudocode).

Listing 2.

The loops that start the M2L operators in the octree traverse the whole tree, compute the M2L interaction set and launch one M2L translation for each interaction in the computed set (pseudocode).

Figure 6.

2D representation of the operator set $M$ . In 3D, there are 342 possible positions (green) relative to the central box (light brown). Since the nearest neighbors (white) and self-interactions are excluded, the number of active operators reduces to 316.

Listing 3.

Basic implementation of the M2L operato (pseudocode).

3.3 Three CUDA parallelization approaches

The previous section described the basic sequential FMM. We will now present three different parallelization approaches. Approach (i) is conceptually straightforward, nevertheless it achieves decent speedups compared to a sequential CPU implementation with only minor parallelization work. It directly maps for-loops to CUDA threads, leaving the sequential program structure nearly unmodified. Approach (ii) performs well for high accuracy demands (high multipole orders $p ≳ 12$ , double precision), however it scales poorly for smaller p. Approach (iii) minimizes the number of arithmetic operations by exploiting the symmetry of the $M$ operator. It scales well in the broad range $0 \leq p \leq 20$ , however it requires additional data structures to minimize bookkeeping and to utilize symmetries.

3.3.1 Naïve parallelization approach (1)

The complete M2L operation in 3D requires 11 loops as shown in Listing 2. Listing 4 shows the comparison of the FMM loop structure and its naïve CUDA parallelization counterpart. Since CUDA provides a 3-component vector threadIdx to control the parallel execution of the threads, the main idea is to map the loops directly to the CUDA structures. To this end, we use a transformation between 1D and 3D indices. Any sequence of n indices $i = 0, .., n - 1$ can be transformed into n m-dimensional tuples of indices $(x_{0},..., x_{m - 1})$ with $x_{j} = (i / m^{j}) mod m, j = 0, ..., m - 1$ . As our FMM operates on cubic domains, the number of boxes is dim(d) in each orthogonal direction for depths $d = 1, .., D$ . The loop over the M2L interaction set $L$ is of a fixed size $(6 \times 6 \times 6)$ on each depth d. The M2L operation contains four for-loops of size $\leq p$ . The iteration over the tree levels is performed by the CPU. Since the M2L operations are level independent, the kernels are spawned asymmetrically for each level of the octree enabling overlapped execution. The last for-loop in Listing 3 is performed sequentially by each thread. It accumulates partial sums

μ_{l m} (j) = \sum_{k = - j}^{j} M_{l + j, m + k} ω_{j k}

of the complete dot product

μ_{l m} = \sum_{j = 0}^{p} \sum_{k = - j}^{j} M_{l + j, m + k} ω_{j k} = \sum_{j = 0}^{p} μ_{l m} (j)

This reduces the number of atomic writes by a factor of $O (p)$ .

Listing 4.

Direct mapping of FMM octree and M2L loops (top part, lines 1–20) to CUDA threads (bottom part, lines 22–43) (pseudocode).

The naïve strategy allows a rapid FMM parallelization. Replacing the existing serial FMM loops with the corresponding CUDA index calculations leads to speedups that make the FMM algorithm applicable for moderate problem sizes. No additional data structures and code modification are required. However, the achieved bandwidth and parallelization efficiency is still far from optimal on the tested hardware.

3.3.2 CUDA dynamic parallelism approach (2)

A substantial performance issue of the naïve approach is integer calculation, which introduces a significant overhead even for large p. For $d = 1, ..., D$ , $p^{3} \times 216 \times 8^{d}$ threads are started, where each computes a valid pair of 3D source and target box indices to perform $O (p)$ complex multiplications and additions $μ_{l m} = μ_{l m} + M_{l + j, m + k} ω_{j k}$ . This leads to $O (p^{3})$ redundant source and target box index computations. A possible mitigation of the expensive index computations is Dynamic Parallelism (Jones, 2012). It allows to spawn kernels recursively, what simplifies hierarchical calculations. The dynamic approach exploits Dynamic Parallelism to avoid the expensive bookkeeping calculations of the naïve approach.

The determination of $μ_{i}^{d}$ for $i {= 0, ..., 8}^{d}$ and $d = 1, ..., D$ is done on the host as given in Listing 1. To this aim, the octree is traversed in 3D to precompute the coordinates $(x_{μ}, y_{μ}, z_{μ})$ of $μ_{i}^{d}$ and the origin coordinates $(x_{L}, y_{L}, z_{L})$ of $L_{i}^{d}$ . Together with 1D index i of $μ_{i}^{d}$ , they are passed as arguments to a parent kernel spawned for each $μ_{i}^{d}$ . Let $P_{J}^{d} : = {j_{0},..., j_{7}}$ be the set of indices of all boxes at depth d contained in the parent box of $ω_{j}^{d}$ . Thus, for an arbitrary $μ_{i}^{d}$ it holds: $L_{i}^{d} = \cup_{J = 0}^{25} P_{J}^{d}$ , $P_{J}^{d} \cap P_{J^{'}}^{d} = \emptyset$ , for any distinct pair $J \neq J^{'}$ . Listing 5 shows the parent kernel, that is engaged only in octree operations. To better utilize concurrency, it is started with $B_{x} = B_{y} = B_{z} = 3$ for $6 \times 6 \times 6$ interaction sets $L_{i}^{d}$ . The parent kernel consists of threads that can be uniquely identified with $(t i d_{x}, t i d_{y}, t i d_{z})$ tuples. Each thread precomputes one 3D source positions $(x_{ω}, y_{ω}, z_{ω})$ of $ω_{j_{0}}^{d}$ , $j_{0} \in P_{J}^{d}$ , $J = 0, ..., 25$ (lines 4–6). The index j ₀ of the proper operator $M_{j_{0} \to i}^{d}$ is calculated from the relative 3D coordinates of $μ_{i}^{d}$ and $ω_{j_{0}}^{d}$ (line 12). Since the parent box index I of $μ_{i}^{d}$ contains only its direct neighbors ( $P_{I}^{d} ⊈ L_{i}^{d}$ ), the direct neighbors in $P_{I}^{d}$ are omitted. Figure 7 illustrates the dynamic kernel. Each parent kernel spawns 26 child kernels with $G_{x} = G_{y} = G_{z} = 2$ and 2D blocks $B_{x} = p + 1, B_{y} = p + 2, B_{z} = 1$ . One child kernel computes eight $O (p^{4})$ M2L transformations between one target $μ_{i}^{d}$ and all $ω_{j}^{d}$ , $j \in P_{J}^{d}$ .

Listing 5.

Parent kernel in the dynamic approach. It determinates valid $ω$ coordinates and spawns child kernels performing M2L computations (pseudocode).

Figure 7.

Dynamic M2L scheme. The CPU computes a $μ_{i}^{d}$ and the corresponding $L_{i}^{d}$ . The parent kernel determines all valid $P_{J}^{d} ≠ L_{i}^{d}$ sets, whereas the child kernel performs the M2L operations for $ω_{j_{k}}^{d}$ , $j_{k} \in P_{J}^{d}$ , $k = 0, ..., 7$ and the target $μ_{i}^{d}$ .

Listing 6 shows child kernel computations, which can be divided in two parts. In the first part, $ω_{j}^{d}$ , $j \in P_{J}^{d}$ and the operator $M_{j \to i}^{d}$ are determined. Since indices of $ω$ and $M$ are provided by the parent kernel, the $(2 \times 2 \times 2)$ grid facilitates a straightforward way to determine eight different $ω_{j}^{d}$ , $j \in P_{J}^{d}$ with $j = j^{'} + B i d_{x} + B i d_{y} * dim (d) + B i d_{z} * dim {(d)}^{2}$ , where $j^{'}$ is the index passed by the parent kernel. The operators $M_{j}^{d}$ are obtained correspondingly, by replacing $dim (d)$ with 7 and $j^{'}$ with the operator index passed by the parent kernel.

To decrease the number of global memory accesses, shared memory is used to cache $ω$ and $M$ . This is advantageous, since one M2L operation executes $O (p^{4})$ steps on $O (p^{2})$ data structures. The triangular shaped matrices are converted to 1D arrays in shared memory, allowing consecutive addressing in the for-loops performing the reduction step. The shared memory storage index s_i of each moment $ω_{l m} \in ω_{j}^{d}$ is calculated with $s_{i} = l^{2} + l + m$ , where $l : = t i d_{y}$ and $m : = t i d_{x}$ . A similar approach holds for the operator $M_{j}^{d}$ , however, since $M \in O (2 p^{2})$ , threads need to be reused to write the elements into shared memory .

In our implementation, the direct neighbor operator is given the size $p = 0$ . This allows to skip the remaining nearest neighbor interactions of $μ_{i}^{d}$ for $P_{J}^{d} ≠ L_{i}^{d}$ by checking for the condition $p = 0$ .

In the second part, the j_k_reduction() function computes the M2L operation. To mitigate the waste of threads due to the triangular shape of the $μ$ , $ω$ and $M$ and to minimize the number of atomic global memory writes, each thread executes the two innermost loops, Eq. (22), sequentially. However, a straightforward approach leads to warp divergence, since threads that correspond to $m > l$ indices of the target moments need to be skipped. Splitting the innermost loop, such that it is partly performed by threads $m > l$ circumvents this issue. Figure 8 and Listing 7 show a possible splitting scheme of the M2L operation. It uses $(p + 1) * (p + 2)$ threads, where some unique thread pairs are mapped to the same target index tuple $(l, m)$ of a target element $μ_{l m} \in μ_{i}^{d}$ . These compute a distinct part of the reduction. The described dynamic approach allows for further optimization, as it splits the computation in two independent parts. The parent kernels handle the octree position evaluation, whereas the child kernels implement the M2L computation. The efficiency of the this approach is satisfactory for high multipole orders. The necessity of an efficient parallelization also for small p leads to the next approach.

Listing 6.

Child kernel in the dynamic approach (pseudocode).

Listing 7.

Reduction function. It computes new target indices $l l, m m$ from arguments $l, m$ in a way that innermost loop is split to be performed by all threads in the block (pseudocode).

Figure 8.

Thread splitting scheme to minimize warp divergence. Example of 12 threads performing six reductions. Threads are allocated in a 2D block. Each target $(l, m)$ is shared by two threads (same color) performing a different part of the dot product.

3.3.3 Presorted list-based approach with symmetric operators (3)

In this approach, the FMM interaction pattern is precomputed for higher efficiency. Additionally, operator symmetries are exploited to reduce both the number of complex multiplications as well as global memory access.

3.3.4 Octree interactions precomputation

The pattern of interactions between the octree boxes is static, hence it can be precomputed and stored. This step does not need to be performance-optimal, as it is done only once at the start of a simulation that typically spans millions of time steps. In a PBC octree configuration, for each $ω_{i}^{d}$ , $d = 1, ..., D$ , $i {= 0, ..., 8}^{d} - 1$ there exists an interaction set $R_{i}^{d}$ , $| R_{i}^{d} | = 208$ . It consists of all the indices j of local moments $μ_{j}^{d}$ that a multipole $ω_{i}^{d}$ is contributing to. Note that the index sets $R_{i}^{d}$ and $L_{i}^{d}$ (defined in section The Multipole-to-Local (M2L) transformation) are identical. For higher efficiency, the sets $R_{i}^{d}$ are precalculated and stored as lists ${\hat{R}}_{i}^{d} = (j_{0},..., j_{188})$ (with nearest neighbors skipped) with an arbitrary but fixed order. In addition, the corresponding operators are determined and stored as lists ${\hat{M}}_{i}^{d}$ ordered in a way that

μ_{j_{k}} = M_{j_{k}} ω_{i}, k = 0, ..., 188

describes all valid M2L transformations of the i-th multipole moment. The precalculation of ${\hat{R}}_{i}^{d}$ and ${\hat{M}}_{i}^{d}$ is achieved by sequentially traversing the octree as shown in Listings 1–2. Within the omega class, each $ω_{i}$ stores 189 pointers to its targets $μ_{j_{k}}$ and the corresponding pointers to M2L operators. We make sure that the internal list orders preserve the validity of Eq. (23), so that it suffices to store direct pointers to the target moments and operators instead of their indices. We will use the index list notation ${\hat{R}}_{i}^{d}$ and ${\hat{M}}_{i}^{d}$ , keeping in mind that the lists actually store pointers.

The precomputed interaction lists ${\hat{R}}_{i}^{d}$ and ${\hat{M}}_{i}^{d}$ enable the following kernel configuration. The number of distinct M2L transformations for each $ω_{i}^{d}$ is set with $G_{z} = 189$ . $G_{x} = G_{y} = p$ , thus $O (p^{2})$ CUDA blocks are spawned to handle $O (p^{4})$ interactions. The remaining $O (p^{2})$ operations are executed sequentially by each thread. $B_{x} {= 8}^{d - 1}$ is the number of boxes on octree level $d - 1$ . This value fits CUDA architecture requirements for the blocksize particularly well, as it is always an even multiple of warpsize. For $d > 4$ , B_x exceeds the largest allowed blocksize of 1024, so we replicate kernel launches for consecutive $ω$ in strides of size 1024.

Figure 9 illustrates that for each position of $ω_{i}^{d}$ within its parent box a specific interaction set $R_{i}^{d}$ results. Therefore, the precomputed lists ${\hat{R}}_{i}^{d}$ and ${\hat{M}}_{i}^{d}$ require reshuffling to facilitate the straightforward indexing within the kernel. Let $G_{s}$ , $s = 0, ..., 7$ denote the eight possible groups governed by the position of $ω$ within the parent box. The $8^{d}$ pointers to $ω$ are reshuffled such that eight consecutive sequences of $8^{d - 1}$ pointers in memory belong to the same group $G_{s}$ . Rigorously, for $ω_{i}^{d}$ where $d = 1, ..., D$ and $i {= 0, ..., 8}^{d}$ it holds $ω_{i_{k}}^{d} \in G_{s}, k = s 8^{d - 1},..., (s + {1) 8}^{d - 1} - 1, s = 0, ..., 7$ . With reshuffled $ω$ pointers the CUDA parallelization proceeds as shown in Figure 10. One kernel is started for each $G_{s}$ , $s = 0, ..., 7$ . Each thread $t i d_{x}$ within a block evaluates one dot product (Eq. (22)). The pointers to the targets $μ_{j_{k}}^{d}$ and to $M_{j_{k}}^{d}$ are accessed with precomputed and presorted lists ${\hat{R}}_{i}^{d}$ and ${\hat{M}}_{i}^{d}$ without any additional integer operation, hence $B i d_{z} \equiv j_{k}$ . The particular moments ${(μ_{l m})}_{j_{k}}^{d}$ , $j_{k} \in G_{s}$ are evaluated in a parallel CUDA block with $l = B i d_{x}$ and $m = B i d_{y}$ . As these are block variables, skipping of the $m > l$ part does not lead to warp divergence . Additionally, only the relevant part of the triangular operator matrix (Figure 5) needs to be loaded into shared memory to be accessed by all threads $t i d_{x}$ within the block.

Figure 9.

Different operator groups. The groups are represented by arrows of distinct color, depend on the position of $ω$ within its parent (red squares). Four possible 2D operator groups $G_{s}$ are shown. In 3D, there are eight different operator groups.

Figure 10.

Parallelization of M2L operations for the operator groups shown in Figure 9. Each single operator is processed in parallel for all boxes on a level by starting one CUDA block for each $ω_{i} \in G_{s}$ (arrows of same color show one example for each operator group). The kernels are replicated for $s = 0, ..., 7$ .

A further improvement of the kernel is gained by rearranging the moments in memory. Threads $t i d_{x}$ of an $l, m$ block access the same moments $ω_{l m}$ of consecutive $ω_{i}$ , with $i = t i d_{x}$ . For warpwise coalesced memory access, the arrangement of the moments in memory is switched from Array of Structures (AoS) to Structure of Arrays (SoA). The moments ${(ω_{l m})}_{i}^{d}$ , $i {= 0, ..., 8}^{d} - 1$ are stored in SoA triangular matrices such that for fixed $l, m$ , the i indexed elements are contiguous in memory.

3.3.5 Operator symmetry

The symmetry of associated Legendre polynomials

P_{l m} (- x) = (- {1)}^{l + m} P_{l m} (x)

emerges directly from their definition (Eqs.5–6). It allows to reduce the size of the operator set $M^{d}$ , as shown in Figure 11.

Figure 11.

Reduction of the M2L operator set. The complete operator set $M^{d}$ (left) and a reduced operator set ${\tilde{M}}^{d} \subseteq M^{d}$ (right) in 2D. Each black arrow symbolizes one M2L operator.

In 3D, the complete operator set spans a cube with the operators originating from its center to all $7^{3} - 3^{3}$ subcubes. The reduced operator ${\tilde{M}}^{d}$ contains 56 M2L operators $ω_{i} \to μ_{j_{x}}$ ( $x = 0, ..., 55$ ) of one of the octants. Let the octant of the cube with parameters $θ, ϕ \in [0, \frac{1}{2} π]$ in spherical coordinates be the reference octant. The generation of particular operator moments with symmetrical functions

M_{l m} = \frac{(l - m)!}{{‖ x ‖}_{2}^{l + 1}} P_{l m} (cos θ) e^{i m ϕ}

where

e^{i m ϕ} = cos (m ϕ) + i sin (m ϕ)

yields three operator symmetry groups containing orthogonal operators that differ only by their sign. Figure 12 shows the symmetry groups in ${\tilde{M}}^{d}$ .

Figure 12.

Grouping of the M2L operators according to their symmetry properties. The reduced operator set as shown in black in the upper left panel (a 2D version is shown in the right panel of Figure 11) is sorted into three groups (red, blue, green) depending on whether the operator is aligned with an axis (red), or within one of the $x y, x z,$ or $y z$ planes (blue), or none of that (green). Each of the red operators (in x, y, and z direction) has one symmetrical counterpart (in $- x$ , $- y$ , and $- z$ direction, respectively). Each of the blue operators has four symmetrical counterparts each (one in each quadrant of the plane). Each of the remaining operators has eight symmetrical counterparts each (one in each octant of the cube).

Depending on the relative position of $ω_{i}$ in its parent box, the interaction set $R_{i}$ requires a different subset of operators in $M$ , see Figure 9. Hence, for each $G_{s}$ , $s = 0, ..., 7$ on each depth $d = 0, ..., D$ , the operator set $M^{d}$ and the corresponding index set $I = {0, ...188}$ can be split in disjunct subsets $T$ such that

\begin{array}{l} T_{α_{1}} = { & M_{j} | ∄ M_{i} \in M : abs (M_{i}) \\ = abs (M_{j}) \forall j \in I} \\ T_{α_{2}} = { & M_{i_{0}}, M_{i_{1}} | abs (M_{i_{x}}) \\ = abs (M_{i_{y}}) \forall x, y \in {0, 1}} \\ T_{α_{3}} = { & M_{i_{0}},..., M_{i_{3}} | abs (M_{i_{x}}) \\ = abs (M_{i_{y}}) \forall x, y \in {0, 1, 2, 3}} \\ T_{α_{4}} = { & M_{i_{0}},..., M_{i_{7}} | abs (M_{i_{x}}) \\ = abs (M_{i_{y}}) \forall x, y \in {0,1,2,3,4,5,6,7}} \end{array}

where $abs (X) : = abs (X_{l, m}), l = 0, ..., p, m = - l,..., l$ and $α_{1} = (0, ..., 6)$ , $α_{2} = (7, ..., 27)$ , $α_{3} = (28, ..., 48)$ , $α_{4} = (49, ..., 55)$ . This property allows to reduce the $G_{z} = 189$ to $G_{z} = 56$ , however further kernel modifications are required.

To make efficient usage of the operator symmetry, the lists ${\hat{M}}_{i}^{d}$ for each $ω_{i}$ are again reordered such that

{\hat{M}}_{i}^{d} = (T_{α_{1}}, T_{α_{2}}, T_{α_{3}}, T_{α_{4}})

The corresponding lists ${\hat{R}}_{i}$ need to be resorted as well, to preserve Eq. (23). A bitset $B$ is added to the $M$ operator class to store signs of its elements. As these are complex values, it takes two bits to store the signs. The bitset is indexed in an array like manner with the most significant bit as the zero-th element. With $u = 2 (l^{2} + l) + 2 m$ , $B (u)$ and $B (u + 1)$ represents the sign of the real and complex part of $M_{l m}$ , respectively. Since bitsets are precomputed during the operator initialization phase, they do not introduce any performance degradation whereas their additional memory footprint is negligible. Figure 13 shows an example of an M2L computation with bitsets. A single operator access from global memory computes 1, 2, 4, or 8 target moments $μ$ depending on the operator type $T$ as given in Eq. (27). The target moments $μ_{t_{γ}}$ , with $γ = 0, ..., β$ , $β = 1, 2, 4, 8$ for any source $ω$ are computed as follows. The intermediate products $μ_{l m, j k} : = M_{l + j, m + k} ω_{j k}$ of the complete dot product Eq. (22) are split in

\begin{array}{l} a c & = ℜ (M_{l + j, m + k}) ℜ (ω_{j k}) \\ b d & = ℑ (M_{l + j, m + k}) ℑ (ω_{j k}) \\ a d & = ℜ (M_{l + j, m + k}) ℑ (ω_{j k}) \\ b c & = ℑ (M_{l + j, m + k}) ℜ (ω_{j k}) \end{array}

where $M_{l + j, m + k}$ are the elements of the reference operator $M_{t_{0}}$ . The split products change their signs for ${(μ_{l m, j k})}_{t_{γ}}$ , $γ \neq 0$ according to ${\hat{B}}_{t_{γ}} = B_{t_{γ}} \oplus B_{t_{0}}$ , where $B_{t_{0}}$ is the bitset of the reference operator, $\oplus$ is the binary XOR operator and $γ = 0, ..., β$ , $β = 0, 2, 4, 8$ depending on the operator symmetry group $T_{α_{x}}$ , $x = 1, 2, 3, 4$ . For $x = 1$ there is no symmetric counterpart of the operator $M$ . For $x > 1$ the intermediate moments calculation is

\begin{array}{l} {(μ_{l m, j k})}_{t_{γ}} = & sign (a c, {\hat{B}}_{t_{γ}}, u) - sign (b d, {\hat{B}}_{t_{γ}}, u + 1) \\ + i (sign (a d, {\hat{B}}_{t_{γ}}, u) + sign (b c, {\hat{B}}_{t_{γ}}, u + 1)) \end{array}

with

sign (x, \hat{B}, u) = {\begin{matrix} x, if \hat{B} (u) = 0 \\ - x, if \hat{B} (u) = 1 \end{matrix}

The $sign$ function changes the sign of x by shifting the u-th bit of the bitset $\hat{B}$ to the left most bit position and by evaluating the $x \oplus {\hat{B}}_{s h i f t e d}$ subsequently. This creates no warp divergence since the sign change is a result of the arithmetic and logical operations.

Figure 13.

M2L operator symmetry exploitation. Left: Computation of four M2L operations with four orthogonal operators $M$ (black arrows). Right: The use of bitsets $B$ (orange) minimizes the redundant memory accesses and reduces the number of complex multiplications.

The constant size of the lists, see Eq. (28), allows to implement the M2L kernel for the symmetry groups $T_{α_{x}}$ , $x = 1, 2, 3, 4$ as a function template, resulting in a single kernel that efficiently treats different groups $T_{α_{x}}$ . Listing 8 shows the kernel configuration for different symmetry groups. For different operator groups $G_{s}$ , $s = 0, ..., 7$ , the kernels are replicated. The computation of the index i of $ω_{i}$ is straightforward as pointers to $ω_{i}$ are contiguous for any $G_{s}$ . For each symmetry group $T_{α_{x}}$ , $x = 1, 2, 3, 4$ one kernel with distinct template parameters is started, where the first parameter describes the cumulative offset of a particular $T_{α_{x}}$ in ${\hat{M}}_{i}^{d}$ and the second one is the number of symmetrical operators within the current $T_{α_{x}}$ . The size of $α_{x}$ , $x = 1, 2, 3, 4$ is set to G_z . The kernels are launched for all configurations $T_{α_{x}} \times G_{s}$ asymmetrically to utilize concurrency. Listing 9 shows the implementation of the symmetric kernel. At the beginning, the reference operator $M_{t_{0}}$ and the bitsets of all orthogonal operators $B_{t_{γ}}$ are loaded into shared memory . Depending on the group $T_{α_{x}}$ , $x = 1, 2, 3, 4$ different number of bitsets is loaded. The if statement, that tests the value of the template parameter group_type, is resolved at compile time. The second part of Listing 9 shows the split complex multiplication implementation. The double nested for-loop computes $1, 2, 4$ or 8 ${(μ_{l m})}_{t_{γ}}$ depending on the symmetry group $T_{α_{x}}$ . The if statement within the innermost loop is resolved at compile time as well.

Listing 8.

Configuration and launches of the symmetric M2L kernel (pseudocode).

Listing 9.

The symmetrical M2L kernel (pseudocode).

4 Benchmarks and discussion

We will now benchmark the performance and analyze the scaling behavior of the three different parallelization approaches described above.

4.1 General FMM scaling behavior

Figure 14 sketches the FMM scaling behavior with respect to the number of particles N, which is $O (N)$ when the tree depth $D$ is chosen properly. However, locally the FMM scales like $O (n^{2})$ , with n being the average number of particles in the boxes at the lowest level $D$ . For a fixed multipole order p, at constant depth, a fixed number of $O (p^{4})$ far field operations are performed. In the regime of small N, the $O (p^{4})$ far field part completely dominates the FMM runtime, which is therefore essentially independent of N. At some critical N, the scaling curve switches to a quadratic behavior, because the P2P computations start to dominate the overall runtime. To benefit from the optimal linear scaling for growing N, the depth needs to be chosen properly. Varying p affects the slope of the overall linear scaling.

Figure 14.

Qualitative sketch of the FMM scaling behavior. The optimal linear scaling (black dashes) with particle number N is achieved if and only if the tree depth $D$ (as indicated by the colored numbers) is properly chosen for each N. For a constant $D$ , for small N, FMM run time is dominated by the far field computations, whereas for growing N, ultimately $O (N^{2})$ scaling results (red dashes).

4.2 Benchmarking procedure

All performance tests were executed on a workstation with an Intel Xeon CPU E5-1620@3.60 GHz with 16 GB physical memory and a Pascal NVIDIA GeForce GTX 1080 Ti with 3584 CUDA Cores. This GPU has a theoretical single precision peak performance of 11.6 TFLOPS and maximal bandwidth of 484 GB/s. The device code was compiled with NVCC 9.1. All kernel timings were measured with the help of cudaEvent s and represent the average runtime of 100 runs.

In our performance comparisons we focus on $D = 3$ , as it provides sufficient parallelism to get proper performance metrics, which are also valid for $D = 4$ . For higher depths, the computation requires more kernel spawns due to limitations of blocksize, which leads to performance decrease. On the tested GTX 1080 Ti GPU a depth of $D = 3$ is suitable for particle counts of $4 \times 10^{4}$ – $3 \times 10^{5}$ , whereas higher N requires $D = 4$ and $D = 5$ for optimum performance. The current implementation of the symmetric parallelization approach allows for a maximum depth of $D = 5$ at which up to $N \approx 1.2 \times 10^{6}$ particles can be handled efficiently. The limitation is caused by memory optimization, in which redundant pointers are stored to minimize the costs of scattered global memory writes. It can be switched off allowing for $D = 6$ and system sizes up to $N \approx 10^{8}$ . On the tested Pascal GPU this optimization increases performance by about 10%, while on a Turing GPU the effect of the optimization is negligible (Kohnke et al., 2020a).

4.3 Microbenchmarking

To evaluate the different parallelization approaches in context of the underlying hardware, we estimated the GPU performance bounds for the M2L transformation operation. To this aim, we implemented two benchmarking microkernels, which execute exactly the number of arithmetic operations and memory accesses as the M2L operation does. However, additional possible performance bottlenecks (Nickolls et al., 2008) like warp divergence, noncoalesced memory accesses, shared memory bank conflicts and atomic writes are eliminated. The microkernels were then used to determine the effective runtime bounds for our three different parallelization approaches.

Figure 15 shows the absolute runtimes of the microkernels. To get the maximal theoretical throughput of the M2L kernel, we assumed the execution of three global memory accesses, eight bytes each, to perform one complex multiplication and one global addition, i.e. $O (p^{4})$ memory accesses for $O (p^{4})$ arithmetic operations. This results in a clearly memory-bound kernel with $O (p^{4})$ scaling in the examined range of p. For the lower bound we assumed an idealized scenario: for each box in the octree, $O (p^{2})$ memory accesses are performed for the moments and operators, whereas the data needed for $O (p^{4})$ operations is assumed to be available in registers. The full $O (p^{4})$ memory access approach utilizes nearly the full bandwidth of the GPU, achieving 370 Gb/s. However, the computation performance is only about 89 GFLOPS, which is $\approx 0.8 %$ of the GPU’s peak performance. The second, $O (p^{2})$ memory access approach, varies depending on p. For $p < 10$ we observe subquadratic scaling of the execution time, indicating that the $O (p^{4})$ arithmetic operations are fully hidden. For $p > 10$ the curve shifts to the $O (p^{4})$ regime, achieving up to 8 TFLOPS, i.e. $70 %$ of the GPU’s peak performance. Compute and memory utilization are balanced at $p = 10$ , which is where the curve switches from $O (p^{2})$ to the $O (p^{4})$ slope.

Figure 15.

Runtime of the memory-bound microkernel (orange) and of the compute-bound microkernel (green) for two tree depths $D = 3$ and $D = 4$ , as indicated by the encircled numbers. The run times of the implemented M2L kernels are expected in the shaded area between memory-bound and compute-bound microkernels.

4.4 Performance comparison

We will now discuss the efficiency of the three proposed parallelization approaches.

4.4.1 Naïve kernel

Figure 16 shows the absolute executions times of the different kernels for $D = 3$ and $D = 4$ . In the whole p range, the naïve kernel’s theoretical arithmetic intensity (see Roofline model, Williams et al., 2009), is a lot smaller than the ratio $R = 23.95$ FLOPS/byte obtained from the GPU’s FLOP rate (11.6 TFLOPS) divided by its memory transfer rate (484 GB/s). This indicates that the kernel is bandwidth limited. However, additional integer computation is required for the calculation of 3D octree indices of the interaction sets $L$ . Figure 17a shows that for the naïve kernel, much less than $10 %$ of the issued instructions are useful floating point operations. A large computational overhead emerges from a high number of integer operations in the innermost for-loop. Here, each of $O (p)$ complex multiplications requires 31 integer additions, 16 integer multiplications, 9 modulo operations and 11 integer divisions. In addition, performance is significantly reduced by warp divergence, since different threads in a block resolve the condition $m > l$ (line 36 of Listing 4). This effect is labeled as no operation in Figure 17a. Avoiding warp divergence would require different kernels for each $0 \leq p \leq 20$ , since mapping of indices to threads differs at each p. Figure 17b shows, how well the GPU is utilized by the naïve kernel. As both memory and compute achieve roughly 50% of maximal possible utilization, the performance is likely limited by the latency of arithmetic or memory operations.

Figure 16.

Runtime comparison of the three different M2L implementations. For each multipole order, $8^{D} \times 189$ single M2L operations are computed for $D = 3$ (top) and $D = 4$ (bottom). The gray area marks the gap between the memory bound and the compute-bound reference microkernel shown in Figure 15.

Figure 17.

Performance analysis of the naïve M2L kernel. (a) Instructions distribution. (b) Memory and compute utilization.

Figure 18.

M2L kernel performance comparison by two different metrics. Top: FLOPS achieved by different kernels. Bottom: Ratio of global and shared memory accesses and floating point operations for the symmetric kernel with respect to the non-symmetric approach.

The maximum number of achieved FLOPS (p =19) is at 2% of the GPU capability, see Figure 18 (top). The effective bandwidth reaches nearly 500 GB/s, which is more than the maximum memory throughput of the GPU. As seen in Figure 16, the naïve kernel achieves runtimes similar to the memory-bound microkernel. For values $p > 6$ , the kernel is slightly faster than the memory-bound reference kernel, and that is for the following reason. The fact that the innermost for-loop of the naïve kernel is executed sequentially allows cache reuse. Each element $ω_{l m}$ can be reused 189 times for a different M2L transformation and each element of the operator $M$ is reused $8^{d}$ times at tree depth d. This leads to local cache throughput of roughly 3,500 GB/s, which approaches the maximal theoretically possible cache bandwidth of the GPU.

Additionally, the achieved occupancy per each Streaming Multiprocessor (SM) is at 46% of a possible maximum of 50% at this kernel configuration, a limit caused by the number of registers (64) used in the kernel.

4.4.2 Dynamic kernel

This kernel utilizes Dynamic Parallelism to minimize the index overhead computation introduced by the naïve kernel. The sizes of the child kernels $(2, 2, 2)$ allow for utilization of concurrency on the GPU, since the work in the child kernels is fully independent. On the underlying hardware the maximum number of resident grids per device is limited to 32.

Figure 19a shows the relative costs emerging from launching the child kernels, which become irrelevant only for large p. At small p, the latencies dominate the computation time, leading to an almost constant runtime for $p \leq 10$ that can be seen in Figure 16 for the dynamic kernel. Figure 19b shows the instruction distribution for the dynamic kernel. From $p \approx 3$ on, the fraction of floating point operations is significantly larger than for the naïve kernel. However, the large number of integer operations and warp divergence still limits the performance. Another issue is the small block size of the child kernels, which limits the SM occupancy for different p values. For $p < 5$ , e.g. each block consists of only one warp . This limits the SM utilization, as 32 blocks but 64 warps can be executed simultaneously. For $p \geq 5$ , the occupancy is only limited by the register usage, achieving nearly 100% of the theoretical possible occupancy. Limiting the register usage, however, increases the local memory traffic and does not further enhance the performance.

Figure 19.

Performance analysis of the dynamic M2L kernel. (a) Kernel launching latencies. (b) Instructions distribution.

As shared memory usage is an essential part of the dynamic kernel, we tested how its utilization affects the overall performance. Figure 20a and Figure 20b show the shared memory throughput and the GPU utilization of the dynamic kernel, respectively. For $p < 12$ the kernel is clearly compute-bound, hence the shared memory operations are fully hidden. At p = 12–15 we can observe a balance between memory and compute operations. Shared memory is limiting only for $p > 15$ , however the achieved throughput of 6000 GB/s is at the limit of the underlying GPU.

Figure 20.

Performance analysis of the dynamic M2L kernel. (a) Shared memory throughput. (b) Shared memory and compute utilization.

The overall performance of the kernel gets considerably higher compared to the naïve kernel for $p > 6$ , achieving a maximum of 1600 GFLOPS for $p = 20$ , see Figure 18 (top). Nevertheless, memory and FLOPS peaks are achieved only for $p > 15$ . At $p < 5$ , the dynamic kernel performs worse than the $O (p^{4})$ benchmarking microkernel mainly due to kernel launch latencies mentioned above.

4.4.3 Symmetric kernel

The symmetric M2L kernel is composed of four subkernels started asynchronously for each symmetry group $T_{α_{x}}$ , $x = 1, 2, 3, 4$ (see Eq. (27)). Figure 21 shows the achieved speedup compared to the standard implementation. Additionally, it also shows the speedups of each symmetric part $T_{α_{x}}$ . As expected, the eight-way symmetric kernel performs best, followed by the four-way and then the two-way symmetric kernel. The overall speedup of the symmetric kernel combines the speedups of the subkernels. However, the achieved overall speedup is not directly proportional to the number of the symmetrical counterparts, because the additional bit shifting and sign changing operations introduce a growing overhead for larger symmetry. In addition, register utilization is larger for the kernels with higher symmetry, harming the achieved SM occupancy. From here on, we will combine the metrics for all subkernels, referring to them as one symmetric kernel. These metrics take into account the overlapping execution of the symmetric subkernels. Figure 22a shows how the hardware is utilized by the symmetric kernel. This kernel is compute-bound over nearly the complete p range. As seen in Figure 22b, the fraction of floating point operations is significantly larger than in both previous approaches. Warp divergence is eliminated completely. The no operation part, resulting from pipeline latencies becomes negligible for $p > 6$ . In this range, the gridsizes ( $(p + 1) * (p + 2) / 2$ , 7, 1) and ( $(p + 1) * (p + 2) / 2$ , 21, 1) of particular symmetric subkernels are too small to provide enough blocks to utilize the complete device, so that pipeline latencies become an issue. However, with larger p hardware utilization increases.

Figure 21.

Speedups due to symmetry properties for different symmetry groups $T_{α_{x}}$ , $x = 2, 3, 4$ (colored) and for the whole symmetric M2L operator kernel (black) compared to a non-symmetric implementation (cyan).

Figure 22.

Performance analysis of the symmetric M2L kernel. (a) Hardware utilization. (b) Instructions distribution.

The register usage of the symmetric subkernels varies between 48% and 71% (not shown). Based on the kernel configuration, the theoretical maximum possible average SM occupancy for all subkernels is 57%. The kernels achieve 50% and are mainly limited by register usage. Further occupancy optimization is unlikely to further increase performance markedly, as kernels with a large register usage do not require optimal occupancy (Volkov, 2010).

As Figure 18 (top) shows, the FLOP rate of the symmetric kernel is much higher in the range $1 < p < 16$ compared to the other two kernels. However, the FLOP rates achieved by the symmetric and dynamical kernel for $p \geq 15$ are similar. Nevertheless, the symmetric scheme clearly outperforms the dynamic one when comparing the absolute execution times, see Figure 16. Figure 18 (bottom) demonstrates that, compared to the non-symmetric kernel, the symmetric kernel needs fewer floating point operations for the M2L stage. Figure 23 shows the absolute performance ratio of the symmetric kernel compared to the compute-bound reference microkernel. It achieves roughly 30% of the reference microkernel performance in $2 < p < 21$ . Additionally, for both depths, the kernel shows an ideal scaling as the showed ratio remains nearly constant for $p > 5$ . Hence, the scaling of the symmetric kernel follows the compute-bound reference microkernel scaling. It achieves a subquadratic scaling for $p < 10$ and switches slowly to $O (p^{4})$ regime, that is limited only by arithmetic throughput, compare Figure 15 and Figure 16.

Figure 23.

Absolute performance ratio of the symmetric M2L kernel with respect to the compute-bound reference microkernel with $O (p^{2})$ memory loads for $D = 3$ and $D = 4$ .

5 Conclusions

Here, we have presented three different CUDA parallelization approaches for the Multipole-to-Local operator, which is performance limiting for the overall FMM performance. The first approach preserves the sequential loop structure and does not require any special data structures. It makes use of CUDA Unified Memory to achieve decent speedups compared to a sequential CPU implementation. It is useful, e.g. for rapid prototyping or for simulation systems with small to moderate numbers of particles. However, it comes with a large computational overhead due to additional integer operations.

The second approach, which exploits CUDA Dynamic Parallelism, avoids this drawback and achieves very good performance at high accuracy demands, i.e. for large multipole orders. Its main drawback is a lack of performance at low multipole orders, for which the first scheme performs better.

The third approach uses abstractions of the underlying octree and interaction patterns to allow for enhanced, efficient GPU utilization and it exploits the symmetries of the Multipole-to-Local operator. As a result, it scales perfectly with growing multipole order p maintaining a very good performance in the whole benchmarked multipole range ( $0 < p < 20$ )

Our FMM implementation has been optimized for biomolecular simulations and has been incorporated into GROMACS as an alternative to the established PME Coulomb solver (Kohnke et al., 2020a). We anticipate that, thanks to the inherently parallel structure of the FMM, future multi-node multi-GPU implementations will eventually overcome the PME scaling bottlenecks (Board et al., 1999; Kutzner et al., 2007, 2014).

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This project was supported by the DFG priority programme Software for Exascale Computing (SPP 1648). A special thanks goes to Jiri Kraus from NVIDIA who supported this project in the early stage of its development and to R. Thomas Ullmann who took part in writing the FMM-GROMACS interface and unit tests.

ORCID iD

Bartosz Kohnke

Carsten Kutzner

References

Abraham

Murtola

Schulz

, et al. (2015) GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1–2: 19–25.

Agullo

Bramas

Coulaud

, et al. (2016) Task-based FMM for heterogeneous architectures. Concurrency and Computation: Practice and Experience 280(9): 2608–2629.

Allen

Tildesley

(1989) Computer Simulation of Liquids, 1987. Oxford: Clarendon Press, p. 385.

Andoh

Yoshii

Fujimoto

, et al. (2013) MODYLAS: a highly parallelized general-purpose molecular dynamics simulation program for large-scale systems with long-range forces calculated by fast multipole method (FMM) and highly scalable fine-grained new parallel processing algorithms. Journal of Chemical Theory and Computation 90(7): 3201–3209.

Andoh

Yoshii

Okazaki

(2020) Extension of the fast multipole method for the rectangular cells with an anisotropic partition tree structure. Journal of Computational Chemistry 41: 1–15.

Arnold

Fahrenberger

Holm

, et al. (2013) Comparison of scalable fast methods for long-range interactions. Physical Review E 88: 063308.

Blanchard

Bramas

Coulaud

, et al. (2015) ScalFMM: a generic parallel fast multipole library. In: SIAM Conference on Computational Science and Engineering (SIAM CSE 2015), Salt Lake City, USA, 18 March 2015

Board

Causey

Leathrum

, et al. (1992) Accelerated molecular dynamics simulation with the parallel fast multipole algorithm. Chemical Physics Letters 1980(1): 89–94.

Board

Humphres

Lambert

, et al. (1999) Ewald and multipole methods for periodic N-body problems. In: P

Deuflhard

Hermans

Leimkuhler

, et al. (eds) Computational Molecular Dynamics: Challenges, Methods, Ideas. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 459–471.

10.

Bock

Blau

Schröder

, et al. (2013) Energy barriers and driving forces in tRNA translocation through the ribosome. Nature Structural & Molecular Biology 20: 1390–1396.

11.

Cheng

Greengard

Rokhlin

(1999) A fast adaptive multipole algorithm in three dimensions. Journal of Computational Physics 1550(2): 468–498.

12.

Dachsel

(2010) An error-controlled fast multipole method. The Journal of Chemical Physics 132: 119901.

13.

Dawson

(1983) Particle simulation of plasmas. Reviews of Modern Physics 55: 403–447.

14.

Ding

Karasawa

Goddard

(1992a) Atomic level simulations on a million particles: the cell multipole method for Coulomb and London nonbond interactions. The Journal of Chemical Physics 970(6): 4309–4315.

15.

Ding

H-Q

Karasawa

Goddard

(1992b) The reduced cell multipole method for Coulomb interactions in periodic systems with million-atom unit cells. Chemical Physics Letters 1960(1): 6–10.

16.

Dror

Dirks

Grossman

, et al. (2012) Biomolecular simulation: a computational microscope for molecular biology. Annual Review of Biophysics 41: 429–452.

17.

Eichinger

Grubmüller

Heller

, et al. (1997) FAMUSAMM: an algorithm for rapid evaluation of electrostatic interactions in molecular dynamics simulations. Journal of Computational Chemistry 180(14): 1729–1749.

18.

Engheta

Murphy

Rokhlin

, et al. (1992) The fast multipole method (FMM) for electromagnetic scattering problems. IEEE Transactions on Antennas and Propagation 400(6): 634–641.

19.

Essmann

Perera

Berkowitz

, et al. (1995) A smooth particle mesh Ewald method. The Journal of Chemical Physics 103: 8577.

20.

Fong

Darve

(2009) The black-box fast multipole method. Journal of Computational Physics 2280(23): 8712–8725.

21.

Garcia

Beckmann

Kabadshow

(2016) Accelerating an FMM-Based Coulomb Solver with GPUs. Berlin, Heidelberg: Springer International Publishing, pp. 485–504.

22.

Gnedin

(2019) Hierarchical particle mesh: an FFT-accelerated fast multipole method. Astrophysical Journal Supplement Series 2430(2): 19.

23.

Greengard

Rokhlin

(1987) A fast algorithm for particle simulations. Journal of Computational Physics 730(2): 325–348.

24.

Greengard

Rokhlin

(1997) A new version of the fast multipole method for the Laplace equation in three dimensions. Acta Numerica 6: 229–269.

25.

Gumerov

Duraiswami

(2006) FMM accelerated BEM for 3D Laplace & Helmholtz equations. In: Presentation for the International Conference on Boundary Element Technique BETEQ-7, Paris, France, September 2006.

26.

Gumerov

Duraiswami

(2008) Fast multipole methods on graphics processors. Journal of Computational Physics 2270(18): 8290–8313.

27.

Hansson

Oostenbrink

van Gunsteren

(2002) Molecular dynamics simulations. Current Opinion in Structural Biology 120(2): 190–196.

28.

Hess

Kutzner

van der Spoel

, et al. (2008) GROMACS 4: algorithms for highly efficient, load-balanced, and scalable molecular simulation. Journal of Chemical Theory and Computation 4(3): 435–447.

29.

Hockney

Eastwood

(1988) Computer Simulation Using Particles. Abingdon: Taylor and Francis, Inc.

30.

Jones

(2012) Introduction to dynamic parallelism. In: GPU Technology Conference Presentation, Vol.338, March.

31.

Knap

Czarnul

(2019) Performance evaluation of Unified Memory with prefetching and oversubscription for selected parallel CUDA applications on NVIDIA Pascal and Volta GPUs. The Journal of Supercomputing 750(11): 7625–7645.

32.

Kohnke

Kutzner

Grubmüller

(2020a) A GPU-accelerated Fast Multipole Method for GROMACS: performance and accuracy. Journal of Chemical Theory and Computation. Under review.

33.

Kohnke

Ullmann

Beckmann

, et al. (2020b) GROMEX—a scalable and versatile fast multipole method for biomolecular simulation. In: H-J

Bungartz

Reiz

Uekermann

Neumann

Nagel

, (eds) Software for Exascale Computing—SPPEXA 2016–2019. Berlin, Heidelberg: Springer International Publishing, pp. 517–543.

34.

Kurzak

Pettitt

(2006) Fast multipole methods for particle dynamics. Molecular Simulation 320(10–11): 775–790.

35.

Kutzner

van der Spoel

Fechner

, et al. (2007) Speeding up parallel GROMACS on high-latency networks. Journal of Computational Chemistry 28(12): 2075–2084.

36.

Kutzner

Apostolov

Hess

, et al. (2014) Scaling of the GROMACS 4.6 molecular dynamics code on SuperMUC. In: M

Bader

Bode

Bungartz

(eds) Parallel Computing: Accelerating Computational Science and Engineering (CSE). Amsterdam: IOS Press, pp 722–730.

37.

Kutzner

Páll

Fechner

, et al. (2019) More bang for your buck: improved use of GPU nodes for GROMACS 2018. Journal of Computational Chemistry 400(27): 2418–2431.

38.

Lane

Shukla

Beauchamp

, et al. (2013) To milliseconds and beyond: challenges in the simulation of protein folding. Current Opinion in Structural Biology 230(1): 58–65.

39.

Lashuk

Chandramowlishwaran

Langston

, et al. (2009) A massively parallel adaptive fast-multipole method on heterogeneous architectures. In: SC ‘09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, Portland, OR, USA, 14–20 November 2009. New York, NY: Association for Computing Machinery.

40.

Nelson

Humphrey

Gursoy

, et al. (1996) NAMD: a parallel, object-oriented molecular dynamics program. The International Journal of Supercomputer Applications and High Performance Computing 100(4): 251–268.

41.

Nickolls

Buck

Garland

, et al. (2008) Scalable parallel programming with CUDA. Queue 60(2): 40–53.

42.

Niedermeier

Tavan

(1994) A structure adapted multipole method for electrostatic interactions in protein dynamics. The Journal of Chemical Physics 1010(1): 734–748.

43.

Páll

Hess

(2013) A flexible algorithm for calculating pair interactions on SIMD architectures. Computer Physics Communications 184: 2641–2650.

44.

Páll

Abraham

Kutzner

, et al. (2015) Tackling exascale software challenges in molecular dynamics simulations with GROMACS. In: S

Markidis

Laure

, (eds) Lecture Notes in Computer Science. EASC 2014, Vol. 8759. Switzerland: Springer International Publishing Switzerland, pp. 1–25.

45.

Patra

Karttunen

Hyvönen

, et al. (2003) Molecular dynamics simulations of lipid bilayers: major artifacts due to truncating electrostatic interactions. Biophysical Journal 840(6): 3636–3645.

46.

Paul

Wehmeyer

Abualrous

, et al. (2017) Protein-peptide association kinetics beyond the seconds timescale from atomistic simulations. Nature Communications 80(1): 1–10.

47.

Potter

Stadel

Teyssier

(2017) PKDGRAV3: beyond trillion particle cosmological simulations for the next era of galaxy surveys. Computational Astrophysics and Cosmology 40(1): 2.

48.

Salomon-Ferrer

Götz

Poole

, et al. (2013) Routine microsecond molecular dynamics simulations with AMBER on GPUs. 2. Explicit solvent particle mesh Ewald. Journal of Chemical Theory and Computation 90(9): 3878–3888.

49.

Schreiber

Steinhauser

(1992) Molecular dynamics studies of solvated polypeptides: why the cut-off scheme does not work. Chemical Physics 1680(1): 75–89.

50.

Schwantes

McGibbon

Pande

(2014) Perspective: Markov models for long-timescale biomolecular dynamics. The Journal of Chemical Physics 1410(9): 090901–090907.

51.

Shamshirgar

Yokota

Tornberg

A-K

, et al. (2019) Regularizing the fast multipole method for use in molecular simulation. The Journal of Chemical Physics 1510(23): 234113.

52.

Shaw

Grossman

Bank

, et al. (2014) Anton 2: raising the bar for performance and programmability in a special-purpose molecular dynamics supercomputer. In: SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, USA, 16–21 November 2014, pp. 41–53. New York, NY: IEEE.

53.

Takahashi

Cecka

Fong

, et al. (2012) Optimizing the Multipole-to-Local operator in the fast multipole method for graphical processing units. International Journal for Numerical Methods in Engineering 89: 105–133.

54.

Tough

RJA

Stone

(1977) Properties of the regular and irregular solid harmonics. Journal of Physics A: Mathematical and General 100(8): 1261.

55.

Volkov

(2010) Better performance at lower occupancy. In: Proceedings of the GPU Technology Conference, GTC, Vol.10, San Jose, CA, p. 16.

56.

White

Head-Gordon

(1996) Rotating around the quartic angular momentum barrier in fast multipole method calculations. The Journal of Chemical Physics 1050(12): 5061–5067.

57.

Williams

Waterman

Patterson

(2009) Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM 520(4): 65–76.

58.

Ying

Biros

Zorin

(2004) A kernel-independent adaptive fast multipole algorithm in two and three dimensions. Journal of Computational Physics 1960(2): 591–626.

59.

Yokota

Narumi

Sakamaki

, et al. (2009) Fast multipole methods on a cluster of GPUs for the meshless simulation of turbulence. Computer Physics Communications 1800(11): 2066–2078.

60.

Yoshii

Andoh

Okazaki

(2020) Fast multipole method for three-dimensional systems with periodic boundary condition in two directions. Journal of Computational Physics 410(9): 940–948.