Sage Journals: Discover world-class research

Abstract

Century-long global climate simulations at high resolutions generate large amounts of data in a parallel architecture. Currently, the community atmosphere model (CAM), the atmospheric component of the NCAR community climate system model (CCSM), uses sequential I/O which causes a serious bottleneck for these simulations. We describe the parallel I/O development of CAM in this paper. The parallel I/ O combines a novel remapping of 3-D arrays with the parallel netCDF library as the I/O interface. Because CAM history variables are stored in disk file in a different index order than the one in CPU resident memory because of parallel decomposition, an index reshuffle is done on the fly. Our strategy is first to remap 3-D arrays from its native decomposition to z-decomposition on a distributed architecture, and from there write data out to disk. Because z-decomposition is consistent with the last array dimension, the data transfer can occur at maximum block sizes and, therefore, achieve maximum I/ O bandwidth. We also incorporate the recently developed parallel netCDF library at Argonne/Northwestern as the collective I/O interface, which resolves a long-standing issue because netCDF data format is extensively used in climate system models. Benchmark tests are performed on several platforms using different resolutions. We test the performance of our new parallel I/O on five platforms (SP3, SP4, SP5, Cray X1E, BlueGene/L) up to 1024 processors. More than four realistic model resolutions are examined, e.g. EUL T85 (~1.4°), FV-B (2° × 2.5°), FV-C (1° × 1.25°), and FV-D (0.5° × 0.625°) resolutions. For a standard single history output of CAM 3.1 FV-D resolution run (multiple 2-D and 3-D arrays with total size 4.1 GB), our parallel I/O speeds up by a factor of 14 on IBM SP3, compared with the existing I/O; on IBM SP5, we achieve a factor of 9 speedup. The estimated time for a typical century-long simulation of FV D-resolution on IBM SP5 shows that the I/O time can be reduced from more than 8 days (wall clock) to less than 1 day for daily output. This parallel I/O is also implemented on IBM BlueGene/ L and the results are shown, whereas the existing sequential I/O fails due to memory usage limitation.

Keywords

CAM climate modeling index reshuffle parallel I/O parallel netCDF

Get full access to this article

View all access options for this article.

References

Bokhari, S.H. (1991). Complete exchange on the Intel iPSC-860 hypercube, Technical Report 91-4, ICA SE.

Carns, P.H. , Ligon, W.B. III , Ross, R.B. , and Thakur, R. (2000). PVFS: A parallel file system for Linux clusters . In Proceedings of the 4th Annual Linux Showcase and Conference , Atlanta, GA , October 2000, pp. 317-327.

Collins, W.D. , Rasch, P.J. , Boville, B.A. , Hack, J.J. , McCaa, J.R. , Williamson, D.L. , Briegleb, B.P. et al. (2006). The formation and atmospheric simulation of the Community Atmosphere Model: CAM3, J . Climate 19: 2144-2161.

Ding, C. (2001). An optimal index reshuffle algorithm for multidimensional arrays and its applications for parallel architectures , IEEE Trans. Parallel Distrib. Syst. 12: 306-315.

Ding, C. and He, Y. (1999). Data organization and I/O in a parallel ocean circulation Model . In Proceedings of SC'1999, Nov.

Edelman, A. , Heller, S. , and Johnsson, S.L. (1994). Index transformation algorithms in a linear algebra framework , IEEE Trans. Parallel Distrib. Syst. 5: 1302-1309.

Foster, I.T. and Worley, P.H. (1997). Parallel algorithms for the spectral transform method , SIAM J. Sci. Stat. Comput. 18: 806-837.

Fox, G.C. , Johnson, M. , Lyzenga, G. , Otto, S. , Salmon, J. , and Walker, D. (1988). Solving problems on concurrent multiprocessors , Prentice-Hall .

Fraser, D. (1976). Array permutation by index-digit permutation , J. Assoc. Comp. Mach. 22: 298-308.

10.

Ghosh, S. , Loft, R. , Tseng, Y.H. , Ding, C. , and Wehner, M. (2006). Computational and I/O performance study of FV CAM in Bluegene/L and Pwr5 system. In ScicomP12, July 18-21, Boulder , Colorado.

11.

Gropp, W. , Lusk, E. , Doss, N. , and Skjellum, A. (1996). A high performance, portable implementation of the MPI Message-Passing Interface standard , Parallel Comput . 22: 789-828.

12.

Gropp, W. , Lusk, E. , and Thakur, R. (1999). Using MPI-2: Advanced features of the Message Passing Interface, Cambridge, MA: MIT Press.

13.

Hack, J.J. , Rosinski, J.M. , Williamson, D.L. , Boville, B.A. , and Truesdale, J.E. (1995). Computational design of the NCAR community climate model , Parallel Comput. 21: 1545-1569.

14.

Kumar, V. , Grama, A. , Gupta, A. , and Karypis, G. (1994). Introduction to Parallel Computing, Redwood City: The Benjamin/Cummings Publishing Company, Inc.

15.

Li, J. , Liao, W.K. , Choudhary, A. , Ross, R. , Thakur, R. , Gropp, W. , Latham, R. , Siegel, A. et al. (2003). Parallel netCDF: a high-performance scientific I/O interface. In Proceedings of SC'2003.

16.

Prost, J.P. , Treumann, R. , Hedges, R. , Jia, B. , and Koniges, A. (2001). MPI-IO/GPFS, an optimized implementation of MPI-IO on top of GPFS . In Proceedings of SC'2001.

17.

Rew, R. , Davis, G. , Emmerson, S. , and Davies, H. (1997). NetCDF user's guide for C, Unidata Program Center . http://www.unidata.ucar.edu/packages/netcdf/guidec/

18.

Wehner, M. , Oliker, L. , and Shalf, J. (2007). Towards ultrahigh resolution models of climate and weather (submitted to International Journal of High Performance Computing Applications).

19.

Yang, M. , Folk, M. , and McGrath, R.E. (2004a). Investigation of parallel netCDF with ROMS, May 4, http://www.hdfgroup.uiuc.edu/HDF5/projects/archive/WRF-ROMS/

20.

Yang, M. , McGrath, R.E. , and Folk, M. (2004b) Performance study of parallel netCDF in ROMS, August 27, http://www.hdfgroup.uiuc.edu/HDF5/projects/archive/WRF-ROMS/

21.

Yang, W.S. and Ding, C.H.Q. (2003). ZioLib: a parallel I/O library, LBNL Tech . Report, LBNL-53521.