Sage Journals: Discover world-class research

Abstract

Hybrid computer systems combine compute units (CUs) of different nature like CPUs, GPUs and FPGAs. Simultaneously exploiting the computing power of these CUs requires a careful decomposition of the applications into balanced parallel tasks according to both the performance of each CU type and the communication costs among them. This paper describes the design and implementation of runtime support for OpenMP hybrid GPU-CPU applications, when mixed with GPU-oriented programming models (e.g. CUDA/HIP). The paper describes the case for a hybrid multi-level parallelization of the NPB-MZ benchmark suite. The implementation exploits both coarse-grain and fine-grain parallelism, mapped to compute units of different nature (GPUs and CPUs). The paper describes the implementation of runtime support to bridge OpenMP and HIP, introducing the abstractions of Computing Unit and Data Placement. We compare hybrid and non-hybrid executions under state-of-the-art schedulers for OpenMP: static and dynamic task schedulings. Then, we improve the set of schedulers with two additional variants: a memorizing-dynamic task scheduling and a profile-based static task scheduling. On a computing node composed of one AMD EPYC 7742 @ 2.250 GHz (64 cores and 2 threads/core, totalling 128 threads per node) and 2 × GPU AMD Radeon Instinct MI50 with 32 GB, hybrid executions present speedups from 1.10× up to 3.5× with respect to a non-hybrid GPU implementation, depending on the number of activated CUs.

Keywords

Heterogeneous programming hybrid CPU-GPU OpenMP CUDA HIP

Get full access to this article

View all access options for this article.

References

Abadi

Agarwal

Barham

, et al. (2015) TensorFlow: large-scale machine learning on heterogeneous distributed systems. http://download.tensorflow.org/paper/whitepaper2015.pdf.

Augonnet

Thibault

Namyst

(2010) StarPU: a runtime system for scheduling tasks over accelerator-based multicore machines. Research Report RR-7240, INRIA. https://hal.inria.fr/inria-00467677.

Augonnet

Thibault

Namyst

, et al. (2011) Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency Computation Practice and Experience 23(2): 187–198. DOI: 10.1002/cpe.1631.

Awan

Manian

Chu

, et al. (2019) Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2? Parallel Computing 85: 141–152.

Bailey

Barszcz

Barton

, et al. (1991) The NAS parallel benchmarks. International Journal of High Performance Computing Applications 5(3): 63–73. DOI: 10.1177/109434209100500306.

Belviranli

Bhuyan

Gupta

(2013) A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Transactions on Architecture and Code Optimization 9(4). DOI: 10.1145/2400682.2400716 .

Bueno

Planas

Duran

, et al. (2012) Productive programming of gpu clusters with ompss. In: 2012 IEEE 26th international parallel and distributed processing symposium, 21–25 May 2012, Shanghai, China. pp. 557–568. DOI: 10.1109/IPDPS.2012.58.

Bull

(1998) Feedback guided dynamic loop scheduling: algorithms and experiments. In: Pritchard

Reeve

(eds) Euro-Par’98 Parallel Processing. Berlin, Heidelberg: Springer Berlin Heidelberg.

Chen

Huo

Agrawal

(2012) Accelerating MapReduce on a Coupled CPU-GPU Architecture. Washington, DC, USA: IEEE Computer Society Press.

, et al. (2013) An efficient scheduling scheme using estimated execution time for heterogeneous computing. Systems 65(2).

, et al. (2016) X-ray computed tomography applied to objects of cultural heritage: porting and testing the filtered back-projection reconstruction algorithm on low power systems-on-chip. In: 2016 24th euromicro international conference on parallel, distributed, and network-based processing (PDP), 17–19 February 2016, Heraklion, Greece, pp. 369–372. DOI:10.1109/PDP.2016.60.

(2003) NAS parallel benchmarks, multi-zone versions. Technical Report NAS-03-010. Moffett Field, CA: NASA Ames Research Center.

13.

Dümmler

Rünger

(2013) Execution schemes for the NPB-MZ benchmarks on hybrid architectures: a comparative study. Advances in Parallel Computing 25: 733–742.

(2005) Automatic thread distribution for nested parallelism in OpenMP. In: Proceedings of the 19th Annual International Conference on Supercomputing. New York, NY, USA: Association for Computing Machinery, pp. 121–130. DOI: 10.1145/1088149.1088166.

, et al. (2011) Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters 21(2): 173–193. DOI: 10.1142/S0129626411000151 .

(2013) Ompss-opencl programming model for heterogeneous systems. In: Kasahara

Kimura

(eds) Languages and Compilers for Parallel Computing. Berlin, Heidelberg: Springer Berlin Heidelberg, 96–111

, et al. (2019) Hybrid CPU/GPU FE2 Multi-Scale Implementation Coupling Alya and Micropp. https://sc19.supercomputing.org/proceedings/tech_poster/tech_poster_pages/rpost152.html

18.

Gonzalez

Morancho

(2020) Multi-GPU parallelization of the NAS multi-zone parallel benchmarks. IEEE Transactions on Parallel and Distributed Systems 32: 229–241.

19.

González

Morancho

(2021) Multi-gpu systems and unified virtual memory for scientific applications: the case of the nas multi-zone parallel benchmarks. Journal of Parallel and Distributed Computing 158: 138–150. DOI: 10.1016/j.jpdc.2021.08.001. https://www.sciencedirect.com/science/article/pii/S0743731521001672

20.

Gowanlock

(2021) Hybrid knn-join: parallel nearest neighbor searches exploiting cpu and gpu architectural features. Journal of Parallel and Distributed Computing 149: 119–137. DOI: 10.1016/j.jpdc.2020.11.004. https://www.sciencedirect.com/science/article/pii/S0743731520304056

21.

Guan

Yan

Jin

(2013) An openmp-cuda implementation of multilevel fast multipole algorithm for electromagnetic simulation on multi-gpu computing systems. IEEE Transactions on Antennas and Propagation 61(7): 3607–3616. DOI: 10.1109/TAP.2013.2258882.

22.

Hamidzadeh

Lilja

(1994) Self-adjusting scheduling: an on-line optimization technique for locality management and load balancing. In: 1994 internatonal conference on parallel processing, NC, USA, 15–19 August 1994, pp. 39–46. IEEE Computer Society. DOI: 10.1109/ICPP.1994.179.

23.

Hermann

Raffin

Faure

, et al. (2010) Multi-GPU and multi-CPU parallelization for interactive physics simulations. In: Euro-Par 2010 - Parallel Processing. Berlin, Heidelberg: Springer Berlin Heidelberg.

24.

Jacobsen

Thibault

Senocak

An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters. Reston, VA: American Institute of Aeronautics and Astronautics. https://arc.aiaa.org/doi/abs/10.2514/6.2010-522

25.

Jacobsen

Senocak

(2013) Multi-level parallelism for incompressible flow computations on gpu clusters. Parallel Computing 39(1): 1–20. DOI: 10.1016/j.parco.2012.10.002. https://www.sciencedirect.com/science/article/pii/S0167819112000804

26.

Karunadasa

Ranasinghe

(2009) Accelerating high performance applications with cuda and mpi. In: 2009 international conference on industrial and information systems (ICIIS). 28–31 December 2009: Peradeniya, Sri Lanka, pp. 331–336. DOI: 10.1109/ICIINFS.2009.5429842.

27.

Kraus

(2013) An introduction to cuda-aware mpi. https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/.71.

28.

LaGrone

Aribuki

Addison

, et al. (2011) A runtime implementation of openmp tasks. In: OpenMP in the Petascale Era. Berlin, Heidelberg: Springer Berlin Heidelberg, 165–178.

29.

Lucco

(1992) A dynamic scheduling method for irregular parallel programs. In: Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation. New York, NY, USA: Association for Computing Machinery, 200–211. DOI: 10.1145/143095.143134.

30.

Manavski

Valle

(2008) CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman string alignment. BMC Bioinformatics 9(Suppl 2). DOI: 10.1186/1471-2105-9-S2-S10.

31.

Markatos

LeBlanc

(1991) Load Balancing vs. Locality Management in Shared-Memory Multiprocessors. USA: University of Rochester. Technical report.

32.

Markatos

LeBlanc

(1994) Using processor affinity in loop scheduling on shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems 5(4): 379–400.

33.

Message Passing Interface Forum (1994) MPI: A Message-Passing Interface Standard. USA: Standard.

34.

Mittal

Vetter

(2015) A survey of cpu-gpu heterogeneous computing techniques. ACM Computing Surveys 47(4). DOI: 10.1145/278839.

35.

Nere

Franey

Hashmi

, et al. (2013) Simulating cortical networks on heterogeneous multi-GPU systems. Journal of Parallel and Distributed Computing 73(7): 953–971.

36.

NVIDIA (2020) GPU-accelerated caffe. https://www.nvidia.com/en-gb/data-center/gpu-accelerated-applications/caffe/.

37.

NVIDIA (2023) CUDA toolkit documentation 12.1. https://docs.nvidia.com/cuda/.

38.

Ogata

Endo

Maruyama

, et al. (2008) An efficient, model-based CPU-GPU heterogeneous FFT library. In: 2008 international symposium on parallel and distributed processing, 14–18 April 2008, Miami, FL, USA, pp. 1–10. IEEE.

39.

Olivier

Porterfield

Wheeler

, et al. (2012) Openmp task scheduling strategies for multicore numa systems. International Journal of High Performance Computing Applications 26(2): 110–124. DOI:10.1177/1094342011434065.

40.

Peña

Lai

, et al. (2020) Hybrid mpi and cuda parallelization for cfd applications on multi-gpu hpc clusters. Scientific Programming 2020: 8862123.

41.

Pennycook

Hammond

Jarvis

, et al. (2011) Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark. SIGMETRICS Performance Evaluation Review 38(4): 23–29. DOI: 10.1145/1964218.1964223.

42.

Pham

Asano

Bolliger

, et al. (2005) The design and implementation of a first-generation cell processor. In: ISSCC. 2005 IEEE international digest of technical papers. Solid-state circuits conference, 2005, 10–10 February 2005, San Francisco, CA, USA, pp. 184–592. DOI:10.1109/ISSCC.2005.1493930.

43.

Polychronopoulos

Kuck

(1987) Guided self-scheduling: a practical scheduling scheme for parallel supercomputers. IEEE Transactions on Computers C-36(12): 1425–1439.

44.

Scogland

Rountree

Feng

, et al. (2012) Heterogeneous task scheduling for accelerated OpenMP. In: 2012 IEEE 26th international parallel and distributed processing symposium, Shanghai, China, 21–25 May 2012, pp. 144–155. DOI: 10.1109/IPDPS.2012.23.

45.

Scogland

TRW

Feng

Rountree

, et al. (2014) CoreTSAR: adaptive worksharing for heterogeneous systems. In: Supercomputing. Cham: Springer International Publishing.

46.

Subramaniam

Eager

(1994) Affinity scheduling of unbalanced workloads. In: Johnson

(ed) Proceedings Supercomputing ’94. Washington, DC, USA: IEEE Computer Society, pp. 214–226. DOI: 10.1109/SUPERC.1994.344281.

, et al. (2012) Shot boundary detection using Zernike Moments in multi-GPU multi-CPU architectures. Journal of Parallel and Distributed Computing 72(9).

48.

Tzen

(1993) Trapezoid self-scheduling: a practical scheduling scheme for parallel compilers. IEEE Transactions on Parallel and Distributed Systems 4(1): 87–98.

49.

Tian

Chandrasekaran

, et al. (2014) NAS parallel benchmarks for gpgpus using a directive-based programming model. In: Brodman

(eds) Languages and Compilers for Parallel Computing. Cham: Springer International Publishing, 67–81. DOI: 10.1007/978-3-319-3-0_5.

(1997) Adaptively scheduling parallel loops in distributed shared-memory systems. IEEE Trans. on Parallel and Distributed Systems 8(1): 70–81.

51.

Yang

Wang

, et al. (2010) Adaptive optimization for petascale heterogeneous CPU/GPU computing. In: IEEE international conference on cluster computing, Heraklion, Greece, 20–24 September 2010.

(2011) Hybrid cuda, openmp, and mpi parallel programming on multicore gpu clusters. Computer Physics Communications 182(1): 266–269. DOI: 10.1016/j.cpc.2010.06.035 . https://www.sciencedirect.com/science/article/pii/S0010465510002262

53.

Yang

Xue

, et al. (2013) A peta-scalable CPU-GPU. Algorithm for Global Atmospheric Simulations 48(8).

54.

Yang

Zhang

Liang

, et al. (2021) Accelerating the Lagrangian particle tracking of residence time distributions and source water mixing towards large scales. Computers and Geosciences 151: 104760. DOI: 10.1016/j.cageo.2021.104760. https://www.sciencedirect.com/science/article/pii/S0098300421000674

55.

Zhang

Yang

, et al. (2021) Fine-grained multi-query stream processing on integrated architectures. IEEE Transactions on Parallel and Distributed Systems 32(9): 2303–2320. DOI: 10.1109/TPDS.2021.3066407.

56.

Zhong

Rychkov

Lastovetsky

(2012) Data partitioning on heterogeneous multicore and multi-GPU systems using functional performance models of data-parallel applications. In: IEEE international conference on cluster computing, Beijing, China, 24–28 September 2012, pp. 191–199. DOI: 10.1109/CLUSTER.2012.34.