Abstract
1 Introduction
The realization of efficient solution procedures for partial differential equations (PDEs) using finite element methods on modern computer systems requires the combination of diverse skills across mathematics, programming languages and high-performance computing. Automated code generation is one of the promising approaches to manage this complexity. It has been increasingly adopted in software systems and libraries. Recent successful examples include FEniCS (Logg et al., 2012), Firedrake (Rathgeber et al., 2016) and FreeFem++ (Hecht, 2012). These software packages provide users with high-level interfaces for high productivity while relying on optimizations and transformations in the code generation pipeline to generate efficient low-level code. The challenge, as in all compilers, is to use appropriate abstraction layers that enable optimizations to be applied that achieve high performance on a broad set of programs and machines.
One particular challenge for generating high-performance code on modern hardware is vectorization. Modern CPUs increasingly rely on SIMD instructions to achieve higher throughput and better energy efficiency. Finite element computation requires the assembly of vectors and matrices which represent differential forms on discretized function spaces. This process consists of applying a local function, often called an
Matrix-free methods avoid building large sparse matrices in applications of the finite element method and thus trade computation for storage. They have become popular for use on modern hardware due to their higher arithmetic intensity (defined as the number of floating-point operations per byte of data transfer). Vectorization is particularly important for computationally intensive high order methods, for which matrix-free methods are often applied. Previous work on improving vectorization of matrix-free operator application, or equivalently, residual evaluation, mostly focuses on exposing library interfaces to the users. Kronbichler and Kormann (2017) first perform a change of basis from nodal points to quadrature points, and provide overloaded SIMD types for users to write a quadrature-point-wise expression for residual evaluation. However, since the transformation is done manually, new operators require manual reimplementation. Knepley and Terrel (2013) also transpose to quadrature-point basis but target GPUs instead. Both works vectorize by grouping elements into batches, either to match the SIMD vector length in CPUs or the shared memory capacity on GPUs. In contrast, Müthing et al. (2017) apply an intra-kernel vectorization strategy and exploit the fact that, in 3D, evaluating both a scalar field and its three derivatives fills the four lanes of an AVX2 vector register (assuming the computation is in double precision). More recently, Kempf et al. (2018) target high order Discontinuous Galerkin (DG) methods on hexahedral meshes using automated code generation to search for vectorization strategies, while taking advantage of the specific memory layout of the data.
In this work, we present a generic and portable solution based on cross-element vectorization. Our vectorization strategy, implemented in Firedrake, is similar to that of Kronbichler and Kormann (2017) but is fully automated through code generation like that of Kempf et al. (2018). We extend the scope of code generation in Firedrake to incorporate the outer iteration over mesh entities and leverage Loopy (Klöckner, 2014), a loop code generator based loosely on the polyhedral model, to systematically apply a sequence of transformations which promote vectorization by grouping mesh entities into batches so that each SIMD lane operates on one entity independently. This automated code generation mechanism enables us to explore the effectiveness of our techniques on operators spanning a wide range of complexity and systematically evaluate our methodology. Compared with an intra-kernel vectorization strategy, this approach is conceptually well-defined, more portable, and produces more predictable performance. Our experimental evaluation demonstrates that the approach consistently achieves a high fraction of hardware peak performance while being fully transparent to end users.
The contributions of this work are as follows: We present the design of a code transformation pipeline that permits the generation of high-performance, vectorized code on a broad class of finite element models. We demonstrate the implementability of the proposed pipeline by realizing it in the Firedrake finite element framework. We provide a thorough evaluation of our code generation strategy and demonstrate that it achieves a substantial fraction of theoretical peak performance across a broad range of test cases and popular C compilers.
The rest of this paper is arranged as follows. After reviewing the preliminaries of code generation for the finite element method in Section 2, we describe our implementation of cross-element vectorization in Firedrake in Section 3. In Section 4, we demonstrate the effectiveness of our approach with experimental results. Finally, we review our contributions and identify future research priorities in Section 5.
2 Preliminaries
The computation of multilinear forms using the basis functions spanning the discretized function spaces is called
The general structure of a linear form
where
Let
2.1 Local assembly
Local assembly of linear forms is the evaluation of the integrals as defined by the weak form of the differential equation on each entity (cell or facet) of the mesh. In Firedrake, the users define the problem in
Listing 1 shows the UFL syntax to assemble the linear form The kernel takes three array arguments in this case: The first part of the kernel (lines 8–20) computes the inverse and the determinant of the Jacobian for the coordinate transformation from the reference element to the current element. This is required for The constant arrays The TSFC performs a sequence of optimization passes by rewriting the tensor operations following mathematical rules before generating the loop nests (see Homolya et al. (2017) for details). This is more powerful than the C compiler’s After the optimization and scheduling stage, the translation of tensor algebra to C in TSFC is a straightforward rewrite of tensor operations to loop nests. This process results in certain artifacts that are shown in Listing 2. For example, since there is no specific representation for subtractions in TSFC, negations are emitted as multiplication by Assembling the linear form of the Helmholtz operator in UFL. Local assembly kernel for the Helmholtz operator of Listing 1 in C generated by TSFC.


2.2 Global assembly
During global assembly, the local contribution from each mesh entity, computed by the element kernel, is accumulated into the global data structure. In Firedrake, PyOP2 (Rathgeber et al., 2012) is responsible for representing and realizing the iteration over mesh entities, marshalling data in and out of the element kernels. The computation is organized as PyOP2 parallel loops, or parloop
Here
Listing 3 shows the C code generated by PyOP2 for the above example. The code is then JIT-compiled when the result is needed in Firedrake. In the context of vectorization, this approach, with the inlined element kernel, forms the baseline in our experimental evaluation. We note the following key features of the global assembly kernel: The outer loop is over mesh entities. For each entity, the computation can be divided into three parts: gathering the input data from global into local data ( The gathering and scattering of data make use of indirect addressing via base pointers ( Different mesh entities might share the same degrees of freedom: parallelization of the scattering loop on line 27 must be aware of the potential for data races. Global assembly interacts with local assembly via a function call (line 25). While the C compiler can inline this call, it creates an artificial boundary to using loop optimization techniques that operate at the source code level. Additionally, even after inlining, outer loop vectorization over mesh entities requires that the C compiler vectorize through data-dependent array accesses. This is the software engineering challenge that has previously limited vectorization to a single local assembly kernel in Firedrake.

Global assembly code for action of the Helmholtz operator in C generated by PyOP2.
3 Vectorization
As one would expect, the loop nests and loop trip counts vary considerably for different integrals, meshes and function spaces that users might choose. This complexity is one of the challenges that our system specifically, and Firedrake more generally, must face in order to deliver predictable performance on modern CPUs, which have increasingly rich SIMD instruction sets.
In the prior approach to vectorization in our framework, the local assembly kernels generated by TSFC were further transformed to facilitate vectorization, as described in Luporini et al. (2015). The arrays are padded so that the trip counts of the innermost loops match multiples of the length of SIMD units. However, padding becomes less effective for low polynomial degrees on wide SIMD units. For instance, AVX512 instructions act on 8 double-precision floats, but the loops for degree 1 polynomials on triangles only have trip counts of 3. Moreover, loop-invariant code motion is very effective in reducing the number of floating-point operations, but hoisted instructions are not easily vectorized as they are no longer in the innermost loops. This effect is more pronounced on tensor-product elements where TSFC is able to apply
3.1 Cross-element vectorization and Loopy
Another strategy is to vectorize across several elements in the outer loop over the mesh entities, as proposed previously by Kronbichler and Kormann (2017). This approach computes the contributions from several mesh entities using SIMD instructions, where each SIMD lane handles one entity. This is always possible regardless of the complexity of the local element kernel because the computation on each entity is independent and identical. One potential downside is the increase in register and cache pressure as the working set is larger.
For a compiler, the difficulty in performing cross-element vectorization (or, more generally, outer-loop vectorization) is to automate a sequence of loop transformations and necessary data layout transformations robustly. This is further complicated by the indirect memory access in data gathering and scattering, and the need to unroll and interchange loops across the indirections, which requires significantly more semantic knowledge than that available to the C compiler.
Loopy (Klöckner, 2014) is a loop generator embedded in Python which targets both CPUs and GPUs. Loopy provides abstractions based on integer sets for loop-based computations and enables powerful transformations based on the polyhedral model (Verdoolaege, 2010). Loop-based computations in Loopy are represented as
To integrate with Loopy, the code generation mechanisms in Firedrake were modified as illustrated in Figure 1. Instead of generating source code directly, TSFC and PyOP2 are modified to generate Loopy kernels. We have augmented the Loopy internal representation with the ability to support a generalized notion of kernel fusion through the nested composition of kernels, specifically through subprograms and inlining. This allows PyOP2 to inline the element kernel such that the global assembly Loopy kernel encapsulates the complete computation of global assembly. This holistic view of the overall computation enables robust loop transformations for vectorization across the boundary between global and local assembly. To facilitate SIMD instruction generation, we also introduced a new OpenMP target to Loopy which extends its existing C-language target to support OpenMP SIMD directives (OpenMP Architecture Review Board, 2018, §2.9.3).

Integration of Loopy in Firedrake for global assembly code generation.
Listing 4 shows an abridged version of the global assembly Loopy kernel for the Helmholtz operator, with the element kernel fused. We highlight the following key features of Loopy kernels: Loop indices, such as Loop transformations operate on kernels by rewriting the loop domain and the statements making up the kernel. In addition, each iname carries a set of Multi-dimensional arrays occur as Dependencies between statements specify their partial order. Statement scheduling can also be controlled by assigning priorities to statements and inames.

Global assembly Loopy kernel of the Helmholtz operator.
For example, to achieve cross-element vectorization (by batching four elements into one SIMD vector in this example) we invoke the following sequence of Loopy transformations on the global assembly Loopy kernel, exploiting the domain knowledge of finite element assembly: Split the outer loop Assign the tag
We show the change to the Loopy kernel after these transformations in Listing 5. In particular, the vector-expansion of the temporary

Changes to global assembly Loopy kernel of the Helmholtz operator after cross-element vectorization.
Listing 6 shows the generated C code for the Helmholtz operator vectorized by grouping together four elements. Apart from the previously mentioned changes, we note the following details: The Loopy provides a mechanism to declare arrays to be aligned to specified memory boundaries (64 bytes in this example). The The remainder loop which handles the cases where the number of elements is non-divisible by four is omitted here for simplicity. After cross-element vectorization, all local assembly instructions (lines 28–42) are inside

Global assembly code for action of the Helmholtz operator in C vectorized by batching four elements.
3.2 Vector extensions
A more direct way to inform the compiler to emit SIMD instructions without depending on OpenMP is to use

Global assembly code for action of the Helmholtz operator in C vectorized by four elements (using vector extensions).
Compared to Listing 6, using vector extensions removes most of the innermost loops, and the only remaining OpenMP SIMD directives are due to the limitations of vector extensions as explained previously.
4 Performance evaluation
We follow the performance evaluation methodology of Luporini et al. (2017) by measuring the assembly time of a range of operators of increasing complexity and polynomial degrees. Due to the large number of combinations of experimental parameters (operators, meshes, polynomial degrees, vectorization strategies, compilers, hyperthreading), we only report an illustrative portion of the results here, with the entire suite of experiments made available on the interactive online repository CodeOcean (Sun, 2019a).
4.1 Experimental setup
We performed experiments on a single node of two Intel systems, based on the Haswell and Skylake microarchitectures, as detailed in Table 1. Firedrake uses MPI for parallel execution where each MPI process handles the assembly for a subset of the domain. Hybrid MPI-OpenMP parallelization is not supported, and we stress that OpenMP pragmas are only used for SIMD vectorization within a single MPI process. Because we observe that hyperthreading usually improves the performance by 5% to 10% for our applications, we set the number of MPI processes to the number of logical cores of the CPU to utilize all available computation resources. Experimental results with hyperthreading turned off are available on CodeOcean. Turbo Boost is switched off to mitigate reproducibility problems that might be caused by dynamic thermal throttling. The batch size, i.e. the number of elements grouped together for vectorization, is chosen to be consistent with the SIMD length. We use three C compilers: GCC 7.3, ICC 18.0 and Clang 5.0. The two vectorization strategies described in Section 3 are tested on all platforms. We use the listed
Hardware specification for experiments.
For the benefit of reproducibility, we have archived the specific versions of Firedrake components used for the experimental evaluation on Zenodo (Sun, 2019b; Zenodo/Firedrake, 2019). An installation of Firedrake with components matching the ones used for evaluation in this paper can be obtained following the instruction at https://www.firedrakeproject.org/download.html, with the following command:
We measure the execution time of assembling the residual for five operators: the mass matrix (“
We performed experiments on both 2D and 3D domains, with two types of mesh used for each case: triangles (“
Operator characteristics and speed-up summary, using GCC with vector extensions. AI: arithmetic intensity (FLOP/byte). D: trip count of loops over degrees of freedom. Q: trip count of loops over quadrature points. H: speed-up over baseline on Haswell, 16 processes, with vector extensions. S: speed-up over baseline on Skylake, 32 processes, with vector extensions.
We record the maximum execution time of the generated global assembly kernels on all MPI processes. This time does not includes the time in synchronization and MPI data exchange for halo updates. Each experiment is run five times, and the average execution time is reported. Exclusive access to the compute nodes is ensured and threads are pinned to individual logical cores. Startup costs such as code generation time and compilation time are excluded. We use automatic vectorization by GCC without batching, compiled with the optimization flags of Table 1, as the baseline for comparison. Comparing with our cross-element strategy, the baseline represents the out-of-the-box performance of compiler auto-vectorization for the local element kernel. We note that cross-element vectorization does not alter the algorithm of local assembly except for the vector expansion, as illustrated by Listing 2 and Listing 6. Consequently, the total number of floating-point operations remains the same. The performance benefit from cross-element vectorization is therefore composable with the operation-reduction optimizations performed by the form compiler to the local assembly kernels.
4.2 Experimental results and discussion
Figures 2
to 5 show the performance of the

The fraction of peak FLOP/s (as listed in Table 1) achieved by different compilers for operators {

The fraction of peak FLOP/s (as listed in Table 1) achieved by different compilers for operators {

The fraction of peak FLOP/s (as listed in Table 1) achieved by different compilers for operators {

The fraction of peak FLOP/s (as listed in Table 1) achieved by different compilers for operators {

Roofline model of operators for baseline and cross-element vectorization using GCC on

Roofline model of operators for baseline and cross-element vectorization using GCC on
4.2.1 Compiler comparison and vector extensions
When vectorizing with OpenMP SIMD directives, ICC gives the best performance for almost all test cases, followed by Clang, while GCC is significantly less competitive. The performance disparity is more pronounced on Skylake than on Haswell. However, when using vector extensions, Clang and GCC improve significantly and are able to match the performance of ICC on both Haswell and Skylake, whereas ICC performs similarly with both OpenMP SIMD directives and vector extensions.
We use the Intel Software Development Emulator 7 to count the number of instructions executed at runtime for code generated by different compilers. The data indicate that although floating-point operations are fully vectorized by all compilers, GCC and Clang generate more load and store instructions between vector registers and memory when using OpenMP SIMD directives for vectorization. One possible reason is that GCC and Clang choose to allocate short arrays to the stack rather than the vector registers directly, causing more load on the memory subsystem.
In light of these results, we conclude that vectorization with vector extensions allows greater performance portability on different compilers and CPUs for our application. It is, therefore, our preferred strategy for implementing cross-element vectorization, and is the default option for the rest of our analysis.
4.2.2 Vectorization speed-up
Almost across the board, significant speed-up is achieved on the test cases under consideration. Slowdown occurs in two situations. On low polynomial degrees, the kernels tend to have low arithmetic intensity so that the increase in available floating point throughput through cross-element vectorization cannot compensate for the increase in the size of the working set of data. On simple operators such as
4.2.3 Achieved fraction of peak performance
We observe that the fraction of peak performance varies smoothly with polynomial degrees for cross-element vectorization in all test cases. This fulfils an important design requirement for Firedrake: small changes in problem setup by the users should not create unexpected performance degradation. This is also shown in Figures 6 and 7 where the results are more clustered on the roofline plots after cross-element vectorization. The baseline shows performance inconsistency, especially for low polynomial degrees. For instance, for the
On simplicial meshes (
We also observe that there exist a small number of test cases where the achieved peak performance is marginally higher than the LINPACK benchmark on Skylake, as shown in Figure 7. One possible reason for this observation is thermal throttling since our test cases typically run for a shorter period of time than LINPACK. We also note that these test cases correspond to high order
4.2.4 Tensor-product elements
We observe higher and more consistent speed-up for tensor-product elements (
The base elements of
5 Conclusion and future work
We have presented a portable, general-purpose solution for delivering stable vectorization performance on modern CPUs for matrix-free finite element assembly for a very broad class of finite element operators on a large range of elements and polynomial degrees. We described the implementation of cross-element vectorization in Firedrake which is transparent to the end users. Although the technique of cross-element vectorization is conceptually simple and has been applied in hand-written kernels before, our implementation based on code generation is automatic, robust and composable with other optimization passes.
We have presented extensive experimental results on two recent Xeon processors that are commonly used in HPC applications, and compared the vectorization performance of three popular C compilers. We showed that by generating appropriate vectorizable code, and using compiler-based vector extensions, we can obtain portably high performance across all three compilers.
The write-back to global data structure is not vectorized in our approach due to possible race conditions. The newly introduced Conflict Detection instructions in the Intel AVX512 instruction set could potentially mitigate this limitation (Zhang, 2016, section 2.3). This could be achieved by informing Loopy to use the relevant intrinsics when generating code for loops with specific tags.
We have focused on the matrix-free finite element method because it is compute-intensive and more likely to benefit from vectorization. However, our methods and implementation also support matrix assembly. Firedrake relies on PETSc (Balay et al., 2017) to handle distributed sparse matrices, and PETSc requires certain data layouts for the input array when updating the global matrices. When several elements are batched together for cross-element vectorization, we need to generate code to explicitly unpack/transpose the local assembly results into individual arrays before calling PETSc functions to update the global sparse matrices for each element. Future improvement could include eliminating this overhead, possibly by extending the PETSc API.
The newly introduced abstraction layer, together with Loopy integration in the code generation and optimization pipeline, opens up multiple possibilities for future research in Firedrake. These include code generation with intrinsics instructions, loop tiling, and GPU acceleration, all of which are already supported in Loopy.
Supplemental material
Supplemental Material, HPC945005_Supplemental_material - A study of vectorization for matrix-free finite element methods
Supplemental Material, HPC945005_Supplemental_material for A study of vectorization for matrix-free finite element methods by Tianjiao Sun, Lawrence Mitchell, Kaushik Kulkarni, Andreas Klöckner, David A Ham and Paul HJ Kelly in The International Journal of High Performance Computing Applications
Supplemental material
Supplemental Material, SageH - A study of vectorization for matrix-free finite element methods
Supplemental Material, SageH for A study of vectorization for matrix-free finite element methods by Tianjiao Sun, Lawrence Mitchell, Kaushik Kulkarni, Andreas Klöckner, David A Ham and Paul HJ Kelly in The International Journal of High Performance Computing Applications
Supplemental material
Supplemental Material, sagej - A study of vectorization for matrix-free finite element methods
Supplemental Material, sagej for A study of vectorization for matrix-free finite element methods by Tianjiao Sun, Lawrence Mitchell, Kaushik Kulkarni, Andreas Klöckner, David A Ham and Paul HJ Kelly in The International Journal of High Performance Computing Applications
Footnotes
Acknowledgments
Declaration of conflicting interests
Funding
Supplemental material
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
