Abstract
1 Introduction
The fast multipole method (FMM) was introduced by Greengard and Rokhlin (1987) to efficiently evaluate pairwise, Coulombic or gravitational, interactions in many body systems, which arise in many diverse fields like biomolecular simulation (Dror et al., 2012; Hansson et al., 2002), astronomy (Arnold et al., 2013; Potter et al., 2017) or plasma physics (Dawson, 1983). Moreover, the FMM can improve iterative solvers for integral equations by speeding up the underlying matrix-vector products (Engheta et al., 1992; Gumerov and Duraiswami, 2006).
The originally proposed FMM uses a spherical harmonics representation of the inverse distance
In atomistic molecular dynamics (MD) simulations, Newton’s equations of motion are solved for a system of
The electrostatic contribution to the inter-atomic forces is governed by Coulomb’s law
where
To this aim, several FMM implementations have been developed. A standard FMM was included as an electrostatic solver for the NAMD package (Board et al., 1992; Nelson et al., 1996). Ding et al. (1992a) proposed the Cell Multipole Method (CMM) to simulate polymer systems of up to 1.2 million atoms. In further work, they combined CMM with the Ewald method showing a considerable speedup with respect to a pure Ewald treatment (Ding et al., 1992b). Niedermeier and Tavan (1994) introduced a structure-adapted multipole method. Eichinger et al. (1997) combined the structure-adapted multipole method with a multiple-time-step algorithm. Andoh et al. (2013) developed MODYLAS, a FMM adoption for very large MD systems and benchmarked it on the K-computer using 65,536 nodes. Very recently, it was extended to support rectangular boxes (Andoh et al., 2020). Yoshii et al. (2020) developed a FMM for MD systems with two-dimensional periodic boundary conditions. Shamshirgar et al. (2019) implemented a regularization method for improved FMM energy conservation. Gnedin (2019) combined fast Fourier transforms (FFTs) and the FMM for improved performance.
Considering efficient parallelization approaches, Gumerov and Duraiswami (2008) pioneered the GPU implementations of the spherical harmonics FMM with rotational operators. Depending on accuracy they achieved speedups of 30–70 with respect to a single CPU. Different GPU parallelization schemes for the “black-box” FMM (Fong and Darve, 2009) were implemented by Takahashi et al. (2012). Yokota et al. (2009) parallelized ExaFmm on a GPU cluster with 32 GPUs achieving a parallel efficiency of 44% and 66% for
However, the early adoptions of FMMs in MD simulation codes were mostly superseded by particle Mesh Ewald (PME) (Essmann et al., 1995) due to its higher single-node performance. As a result, PME currently dominates the field. It is based on the FFT, which inherently provides the PBC solution. Nevertheless, PME suffers from a scaling bottleneck when parallelized over many nodes, as the underlying FFTs require all-to-all communication (Board et al., 1999; Kutzner et al., 2007, 2014). In addition, large systems with nonuniform particle distributions become memory intensive, since PME evaluates the forces on a uniform mesh across the whole computational domain.
In the era of ever-increasing parallelism and exascale computers, it is time to revisit the FMM, which does not suffer from the above mentioned limitations. To this end, we implemented and benchmarked a single-node full CUDA parallel FMM. Our implementation has been tailored for MD simulations, i.e. it targets a millisecond order runtime for one MD step by careful GPU parallelization of all FMM stages and by optimizing their flow to hide possible latencies. It was also meticulously integrated into the GROMACS package to avoid additional FMM independent performance bottlenecks. Here, we present three different parallelization approaches. The implementation is based on the ScaFaCos FMM (Arnold et al., 2013), which utilizes spherical harmonics to describe the
Here, we focus on the CUDA parallelization of the Multipole-to-Local (M2L) operator, which is most limiting to the overall FMM far field performance. An overview of the parallelized FMM, including all stages and complete runtimes, can be found in Kohnke et al. (2020b).
2 The fast multipole method
We consider a system of
where
2.1 Mathematical foundations
Expansion of the inverse distance between arbitrary particles
where
are spherical harmonics and
where
The normalized associated Legendre polynomials form an orthonormal set of basis functions on the surface of a sphere. The
and chargeless local moments
respectively. The moments weighted with corresponding charges
and charged local moments
This allows to evaluate the potential at arbitrary particles at
This calculation is referred to as far field. To achieve convergence in Eq. (3),
has to be fulfilled for all distinct index pairs
of particles
where
where
where
2.2 Algorithm
Applying the operators defined in the previous section requires truncation of Eq. (3) to a finite multipole order
and

Indexing of triangular shaped matrices. The used indexing scheme is based on standard matrix index notation: The first subscript is a row number, the second one is a column number, which can be negative. In case of

2D example of FMM tree depth and resulting box subdivision for depths

The six different stages of the FMM with an exemplary execution time distribution at the center. The near field part (P2P, top right corner) can be executed concurrently with the far field (stages 1–5) in a parallel implementation. Green squares indicate the representation by multipoles, light brown squares a representation by local moments, blue squares indicate direct summation.
2.3 The Multipole-to-Local (M2L) transformation
We will now explain the M2L transformation and its execution hierarchy. Let
Let
Figure 5 shows one

2D interaction set

One M2L transformation. The matrix-vector like multiplication requires a part of the
3 Implementation
We focus our GPU parallelization efforts on the M2L operator, as it is the most time-consuming FMM far field operator (see Figure 3 above and Figure 12 in Kohnke et al., 2020b). In PBC, it requires 189 transformations per box, whereas both M2M and L2L, which translate the moments between different tree levels, require only a single transformation per octree box (except for the root box). Since these transformations are of the same complexity, M2L involves
3.1 CUDA implementation considerations
We will now briefly outline the CUDA programming model, see Nickolls et al. (2008) for details. A typical GPU consists of a few thousand cores that are grouped into larger units called multiprocessors. CUDA threads are organized in
We will use the following abbreviations:
3.2 Sequential FMM and data structures
Our CUDA implementation is based on a C++11 version of the sequential ScaFaCos FMM (Arnold et al., 2013). It provides class templates with a possibility to use diverse memory allocators. With CUDA Unified Memory (Knap and Czarnul, 2019) the usage of original data structures became feasible by harnessing the C++ memory allocators.
To allow for an efficient manipulation of triangular shaped data (see Figures 1 and 5), we have implemented a dedicated
Let
where
Listing 2 shows the sequential form of the M2L transformation. The first four for-loops (lines 3–9) traverse the octree as shown in Listing 1.
where
Loops for traversing an octree in 3D space (pseudocode).
The loops that start the M2L operators in the octree traverse the whole tree, compute the M2L interaction set and launch one M2L translation for each interaction in the computed set (pseudocode).

2D representation of the operator set
Basic implementation of the M2L operato (pseudocode).
3.3 Three CUDA parallelization approaches
The previous section described the basic sequential FMM. We will now present three different parallelization approaches. Approach (i) is conceptually straightforward, nevertheless it achieves decent speedups compared to a sequential CPU implementation with only minor parallelization work. It directly maps for-loops to CUDA threads, leaving the sequential program structure nearly unmodified. Approach (ii) performs well for high accuracy demands (high multipole orders
3.3.1 Naïve parallelization approach (1)
The complete M2L operation in 3D requires 11 loops as shown in Listing 2. Listing 4 shows the comparison of the FMM loop structure and its naïve CUDA parallelization counterpart. Since CUDA provides a 3-component vector
of the complete dot product
This reduces the number of atomic writes by a factor of
Direct mapping of FMM octree and M2L loops (top part, lines 1–20) to CUDA threads (bottom part, lines 22–43) (pseudocode).
The naïve strategy allows a rapid FMM parallelization. Replacing the existing serial FMM loops with the corresponding CUDA index calculations leads to speedups that make the FMM algorithm applicable for moderate problem sizes. No additional data structures and code modification are required. However, the achieved bandwidth and parallelization efficiency is still far from optimal on the tested hardware.
3.3.2 CUDA dynamic parallelism approach (2)
A substantial performance issue of the naïve approach is integer calculation, which introduces a significant overhead even for large
The determination of
Parent kernel in the dynamic approach. It determinates valid

Dynamic M2L scheme. The CPU computes a
Listing 6 shows child kernel computations, which can be divided in two parts. In the first part,
To decrease the number of global memory accesses,
In our implementation, the direct neighbor operator is given the size
In the second part, the
Child kernel in the dynamic approach (pseudocode).
Reduction function. It computes new target indices

Thread splitting scheme to minimize
3.3.3 Presorted list-based approach with symmetric operators (3)
In this approach, the FMM interaction pattern is precomputed for higher efficiency. Additionally, operator symmetries are exploited to reduce both the number of complex multiplications as well as global memory access.
3.3.4 Octree interactions precomputation
The pattern of interactions between the octree boxes is static, hence it can be precomputed and stored. This step does not need to be performance-optimal, as it is done only once at the start of a simulation that typically spans millions of time steps. In a PBC octree configuration, for each
describes all valid M2L transformations of the
The precomputed interaction lists
Figure 9 illustrates that for each position of

Different operator groups. The groups are represented by arrows of distinct color, depend on the position of

Parallelization of M2L operations for the operator groups shown in Figure 9. Each single operator is processed in parallel for all boxes on a level by starting one CUDA block for each
A further improvement of the kernel is gained by rearranging the moments in memory. Threads
3.3.5 Operator symmetry
The symmetry of associated Legendre polynomials
emerges directly from their definition (Eqs.5–6). It allows to reduce the size of the operator set

Reduction of the M2L operator set. The complete operator set
In 3D, the complete operator set spans a cube with the operators originating from its center to all
where
yields three operator symmetry groups containing orthogonal operators that differ only by their sign. Figure 12 shows the symmetry groups in

Grouping of the M2L operators according to their symmetry properties. The reduced operator set as shown in black in the upper left panel (a 2D version is shown in the right panel of Figure 11) is sorted into three groups (red, blue, green) depending on whether the operator is aligned with an axis (red), or within one of the
Depending on the relative position of
where
To make efficient usage of the operator symmetry, the lists
The corresponding lists
where
with
The

M2L operator symmetry exploitation. Left: Computation of four M2L operations with four orthogonal operators
The constant size of the lists, see Eq. (28), allows to implement the M2L kernel for the symmetry groups
Configuration and launches of the symmetric M2L kernel (pseudocode).
The symmetrical M2L kernel (pseudocode).
4 Benchmarks and discussion
We will now benchmark the performance and analyze the scaling behavior of the three different parallelization approaches described above.
4.1 General FMM scaling behavior
Figure 14 sketches the FMM scaling behavior with respect to the number of particles

Qualitative sketch of the FMM scaling behavior. The optimal linear scaling (black dashes) with particle number
4.2 Benchmarking procedure
All performance tests were executed on a workstation with an Intel Xeon CPU E5-1620@3.60 GHz with 16 GB physical memory and a Pascal NVIDIA GeForce GTX 1080 Ti with 3584 CUDA Cores. This GPU has a theoretical single precision peak performance of 11.6 TFLOPS and maximal bandwidth of 484 GB/s. The device code was compiled with NVCC 9.1. All kernel timings were measured with the help of
In our performance comparisons we focus on
4.3 Microbenchmarking
To evaluate the different parallelization approaches in context of the underlying hardware, we estimated the GPU performance bounds for the M2L transformation operation. To this aim, we implemented two benchmarking microkernels, which execute exactly the number of arithmetic operations and memory accesses as the M2L operation does. However, additional possible performance bottlenecks (Nickolls et al., 2008) like
Figure 15 shows the absolute runtimes of the microkernels. To get the maximal theoretical throughput of the M2L kernel, we assumed the execution of three global memory accesses, eight bytes each, to perform one complex multiplication and one global addition, i.e.

Runtime of the memory-bound microkernel (orange) and of the compute-bound microkernel (green) for two tree depths
4.4 Performance comparison
We will now discuss the efficiency of the three proposed parallelization approaches.
4.4.1 Naïve kernel
Figure 16 shows the absolute executions times of the different kernels for

Runtime comparison of the three different M2L implementations. For each multipole order,

Performance analysis of the naïve M2L kernel. (a) Instructions distribution. (b) Memory and compute utilization.

M2L kernel performance comparison by two different metrics. Top: FLOPS achieved by different kernels. Bottom: Ratio of global and shared memory accesses and floating point operations for the symmetric kernel with respect to the non-symmetric approach.
The maximum number of achieved FLOPS (
Additionally, the achieved occupancy per each Streaming Multiprocessor (SM) is at 46% of a possible maximum of 50% at this kernel configuration, a limit caused by the number of registers (64) used in the kernel.
4.4.2 Dynamic kernel
This kernel utilizes Dynamic Parallelism to minimize the index overhead computation introduced by the naïve kernel. The sizes of the child kernels
Figure 19a shows the relative costs emerging from launching the child kernels, which become irrelevant only for large

Performance analysis of the dynamic M2L kernel. (a) Kernel launching latencies. (b) Instructions distribution.
As

Performance analysis of the dynamic M2L kernel. (a) Shared memory throughput. (b) Shared memory and compute utilization.
The overall performance of the kernel gets considerably higher compared to the naïve kernel for
4.4.3 Symmetric kernel
The symmetric M2L kernel is composed of four subkernels started asynchronously for each symmetry group

Speedups due to symmetry properties for different symmetry groups

Performance analysis of the symmetric M2L kernel. (a) Hardware utilization. (b) Instructions distribution.
The register usage of the symmetric subkernels varies between 48% and 71% (not shown). Based on the kernel configuration, the theoretical maximum possible average SM occupancy for all subkernels is 57%. The kernels achieve 50% and are mainly limited by register usage. Further occupancy optimization is unlikely to further increase performance markedly, as kernels with a large register usage do not require optimal occupancy (Volkov, 2010).
As Figure 18 (top) shows, the FLOP rate of the symmetric kernel is much higher in the range

Absolute performance ratio of the symmetric M2L kernel with respect to the compute-bound reference microkernel with
5 Conclusions
Here, we have presented three different CUDA parallelization approaches for the Multipole-to-Local operator, which is performance limiting for the overall FMM performance. The first approach preserves the sequential loop structure and does not require any special data structures. It makes use of CUDA Unified Memory to achieve decent speedups compared to a sequential CPU implementation. It is useful, e.g. for rapid prototyping or for simulation systems with small to moderate numbers of particles. However, it comes with a large computational overhead due to additional integer operations.
The second approach, which exploits CUDA Dynamic Parallelism, avoids this drawback and achieves very good performance at high accuracy demands, i.e. for large multipole orders. Its main drawback is a lack of performance at low multipole orders, for which the first scheme performs better.
The third approach uses abstractions of the underlying octree and interaction patterns to allow for enhanced, efficient GPU utilization and it exploits the symmetries of the Multipole-to-Local operator. As a result, it scales perfectly with growing multipole order
Our FMM implementation has been optimized for biomolecular simulations and has been incorporated into GROMACS as an alternative to the established PME Coulomb solver (Kohnke et al., 2020a). We anticipate that, thanks to the inherently parallel structure of the FMM, future multi-node multi-GPU implementations will eventually overcome the PME scaling bottlenecks (Board et al., 1999; Kutzner et al., 2007, 2014).
