Century-long global climate simulations at high resolutions generate large amounts of
data in a parallel architecture. Currently, the community atmosphere model (CAM),
the atmospheric component of the NCAR community climate system model (CCSM), uses
sequential I/O which causes a serious bottleneck for these simulations. We describe
the parallel I/O development of CAM in this paper. The parallel I/ O combines a
novel remapping of 3-D arrays with the parallel netCDF library as the I/O interface.
Because CAM history variables are stored in disk file in a different index order
than the one in CPU resident memory because of parallel decomposition, an index
reshuffle is done on the fly. Our strategy is first to remap 3-D arrays from its
native decomposition to z-decomposition on a distributed architecture, and from
there write data out to disk. Because z-decomposition is consistent with
the last array dimension, the data transfer can occur at maximum block sizes and,
therefore, achieve maximum I/ O bandwidth. We also incorporate the recently
developed parallel netCDF library at Argonne/Northwestern as the collective I/O
interface, which resolves a long-standing issue because netCDF data format is
extensively used in climate system models. Benchmark tests are performed on several
platforms using different resolutions. We test the performance of our new parallel
I/O on five platforms (SP3, SP4, SP5, Cray X1E, BlueGene/L) up to 1024 processors.
More than four realistic model resolutions are examined, e.g. EUL T85
(~1.4°), FV-B (2° × 2.5°), FV-C
(1° × 1.25°), and FV-D (0.5°
× 0.625°) resolutions. For a standard single history output of
CAM 3.1 FV-D resolution run (multiple 2-D and 3-D arrays with total size 4.1 GB),
our parallel I/O speeds up by a factor of 14 on IBM SP3, compared with the existing
I/O; on IBM SP5, we achieve a factor of 9 speedup. The estimated time for a typical
century-long simulation of FV D-resolution on IBM SP5 shows that the I/O time can be
reduced from more than 8 days (wall clock) to less than 1 day for daily output. This
parallel I/O is also implemented on IBM BlueGene/ L and the results are shown,
whereas the existing sequential I/O fails due to memory usage limitation.