Sage Journals: Discover world-class research

Abstract

Lossy compressors are increasingly adopted in scientific research, tackling volumes of data from experiments or parallel numerical simulations and facilitating data storage and movement. In contrast with the notion of entropy in lossless compression, no theoretical or data-based quantification of lossy compressibility exists for scientific data. Users rely on trial and error to assess lossy compression performance. As a strong data-driven effort toward quantifying lossy compressibility of scientific datasets, we provide a statistical framework to predict compression ratios of lossy compressors. Our method is a two-step framework where (i) compressor-agnostic predictors are computed and (ii) statistical prediction models relying on these predictors are trained on observed compression ratios. Proposed predictors exploit spatial correlations and notions of entropy and lossyness via the quantized entropy. We study 8+ compressors on 6 scientific datasets and achieve a median percentage prediction error less than 12%, which is substantially smaller than that of other methods while achieving at least a 8.8× speedup for searching for a specific compression ratio and 7.8× speedup for determining the best compressor out of a collection.

Keywords

Scientific data data reduction lossy compression high-performance applications data storage and movements

Get full access to this article

View all access options for this article.

References

Ainsworth

Tugluk

Whitney

, et al. (2018) Multilevel techniques for compression and reduction of scientific data—the univariate case. Computing and Visualization in Science 19(5–6): 65–76. DOI: 10.1007/s00791-018-00303-9

Ainsworth

Tugluk

Whitney

, et al. (2019a) Multilevel techniques for compression and reduction of scientific data—the multivariate case. SIAM Journal on Scientific Computing 41(2): A1278–A1303. DOI: 10.1137/18M1166651

Ainsworth

Tugluk

Whitney

, et al. (2019b) Multilevel techniques for compression and reduction of scientific data-quantitative control of accuracy in derived quantities. SIAM Journal on Scientific Computing 41(4): A2146–A2171. DOI: 10.1137/18M1208885

Ballester-Ripoll

Lindstrom

Pajarola

(2020) TTHRESH: tensor compression for multidimensional visual data. IEEE Transactions on Visualization and Computer Graphics 26(9): 2891–2903. DOI: 10.1109/TVCG.2019.2904063

Biswas

Dutta

Lawrence

, et al. (2021) Probabilistic data-driven sampling via multi-criteria importance analysis. IEEE Transactions on Visualization and Computer Graphics 27(12): 4439–4454. DOI: 10.1109/TVCG.2020.3006426

Cappello

, et al. (2019) Use cases of lossy compression for floating-point data in scientific datasets. International Journal of High Performance Computing Applications (IJHPCA) 33: 1201–1220.

Claramunt

(2005) A spatial form of diversity. In: International Conference on Spatial Information Theory, Ellicottville NY, September 14–18. Springer, pp. 218–231. DOI: 10.1007/11556114_14

Delaunay

Courtois

Gouillon

(2018) Evaluation of lossless and lossy algorithms for the compression of scientific datasets in NetCDF-4 or HDF5 formatted files. Numerical Methods. Preprint. DOI: 10.5194/gmd-2018-250

Cappello

(2016) Fast error-bounded lossy hpc data compression with SZ. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, USA, 23–27 May 2016. pp. 730–739. DOI: 10.1109/IPDPS.2016.11

10.

Dyer

Fry

(2022) Matter in extreme conditions upgrade conceptual design report DOI: 10.2172/1866100. URL https://www.osti.gov/biblio/1866100

11.

Gersho

Gray

(2012) Vector Quantization and Signal Compression, volume 159. Springer Science and Business Media.

12.

Grosset

Biwer

Pulido

, et al. (2020) Foresight: Analysis That Matters for Data Reduction. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA, 09–19 November. pp. 1–15. DOI: 10.1109/SC41405.2020.00087

13.

Habib

Morozov

Frontiere

et al. (2016) HACC: extreme scaling and performance across diverse architectures. Communications of the ACM 60(1): 97–104.

14.

Hannachi

Jolliffe

Stephenson

(2007) Empirical orthogonal functions and related techniques in atmospheric science: a review. International Journal of Climatology: A Journal of the Royal Meteorological Society 27(9): 1119–1152.

15.

Hastie

Tibshirani

Friedman

, et al. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 2. Springer.

16.

Klöwer

Razinger

Dominguez

, et al. (2021) Compressing atmospheric data into its real information content. Nature Computational Science 1(11): 713–724.

17.

Kolda

Bader

(2009) Tensor decompositions and applications. SIAM Review 51(3): 455–500.

18.

Krasowska

Bessac

Underwood

, et al. (2021) Exploring lossy compressibility through statistical correlations of scientific datasets. In: 2021 7th International Workshop on Data Analysis and Reduction for Big Scientific Data (DRBSD-7), St. Louis, MO, USA, 14 November. IEEE, pp. 47–53. DOI: 10.1109/DRBSD754563.2021.00011

19.

Liang

, et al. (2019a) Significantly improving lossy compression quality based on an optimized hybrid prediction model. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Denver Colorado: ACM, pp. 1–26. DOI: 10.1145/3295500.3356193

20.

Liang

Tao

, et al. (2018a) An efficient transformation scheme for lossy data compression with point-wise relative error bound. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK, 10–13 September. pp. 179–189. DOI:10.1109/CLUSTER.2018.00036

21.

Liang

Tao

, et al. (2018b) Error-controlled lossy compression optimized for high compression ratios of scientific datasets. In: 2018 IEEE International Conference on Big Data (Big Data). Seattle, WA, USA: IEEE, pp. 438–447. DOI: 10.1109/BigData.2018.8622520

22.

Liang

Tao

, et al. (2019b) Improving performance of data dumping with lossy compression for scientific simulation : 11.

23.

Lindstrom

(2014) Fixed-rate compressed floating-point arrays. IEEE Transactions on Visualization and Computer Graphics 20(12): 2674–2683. DOI: 10.1109/TVCG.2014.2346458

24.

Liu

Zhao

, et al. (2022) Dynamic quality metric oriented error bounded lossy compression for scientific datasets. In: 2022 SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, pp. 892–906.

25.

Liu

, et al. (2018) Understanding and modeling lossy compression schemes on HPC scientific data. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). Vancouver, BC: IEEE, pp. 348–357. DOI: 10.1109/IPDPS.2018.00044

26.

Marra

Wood

(2011) Practical variable selection for generalized additive models. Computational Statistics and Data Analysis 55(7): 2372–2387. DOI: 10.1016/j.csda.2011.02.004. URL https://www.sciencedirect.com/science/article/pii/S0167947311000491

27.

Matheron

(1963) Principles of geostatistics. Economic Geology 58(8): 1246–1266.

28.

Moon

Park

Song

(2022) Prediction of compression ratio for transform-based lossy compression in time-series datasets. In: 2022 24th International Conference on Advanced Communication Technology (ICACT), Korea, Republic of, 13–16 February. IEEE, pp. 142–146. DOI: 10.23919/ICACT53585.2022.9728954

29.

Qin

Wang

Liu

, et al. (2020) Estimating lossy compressibility of scientific data using deep neural networks. IEEE Letters of the Computer Society 3(1): 5–8. DOI: 10.1109/LOCS.2020.2971940

30.

Shannon

Weaver

(1948) The mathematical theory of communication : 131.

31.

Tao

Chen

, et al. (2017) Significantly improving lossy compression for scientific data sets based on multidimensional prediction and error-controlled quantization. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Orlando, Florida, USA, May 29–June 2. pp. 1129–1139. DOI:10.1109/IPDPS.2017.115

32.

Tao

Guo

, et al. (2019a) Z-checker: A framework for assessing lossy compression of scientific data. The International Journal of High Performance Computing Applications 33(2): 285–303. DOI: 10.1177/1094342017737147

33.

Tao

Liang

, et al. (2018) Fixed-PSNR lossy compression for scientific data. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, United Kingdom, September 10–13. pp. 314–318. DOI: 10.1109/CLUSTER.2018.00048

34.

Tao

Liang

, et al. (2019b) Optimizing lossy compression rate-distortion from automatic online selection between SZ and ZFP. IEEE Transactions on Parallel and Distributed Systems 30(8): 1857–1871. DOI: 10.1109/TPDS.2019.2894404

35.

Tucker

(1963) Implications of factor analysis of three-way matrices for measurement of change. Problems in Measuring Change 15(122–137): 3.

36.

Underwood

(2020) https://github.com/robertu94/spack_packages

37.

Underwood

Calhoun

, et al. (2022) OptZConfig: efficient parallel optimization of lossy compression configuration. IEEE Transactions on Parallel and Distributed Systems : 1–15.

38.

Underwood

Calhoun

, et al. (2020) FRaZ: A generic high-fidelity fixed-ratio lossy compression framework for scientific floating-point data. In: 34th IEEE International Parallel and Distributed Processing Symposium. New Orleans: IEEE.

39.

Underwood

Malvoso

Calhoun

, et al. (2021) Productive and Performant Generic Lossy Data Compression with LibPressio. In: 2021 7th International Workshop on Data Analysis and Reduction for Big Scientific Data (DRBSD-7), St. Louis, MO, USA, 14 November. pp. 1–10. DOI:10.1109/DRBSD754563.2021.00005

40.

Wang

Zhao

(2018) Spatial heterogeneity analysis: Introducing a new form of spatial entropy. Entropy 20(6): 398.

41.

Wood

(2017) Generalized Additive Models: An Introduction with R. 2 edition. Chapman and Hall/CRC.

42.

Yoon

DeMirci

Sierra

, et al. (2017) Se-SAD Serial Femtosecond Crystallography Datasets from Selenobiotinyl-Streptavidin. Science - Data.

43.

Zender

(2016) Bit Grooming: Statistically accurate precision-preserving quantization with compression, evaluated in the netCDF Operators (NCO, v4.4.8+). Geoscientific Model Development 9(9): 3199–3211. DOI: 10.5194/gmd-9-3199-2016

44.

Zhao

Dmitriev

, et al. (2021) Optimizing error-bounded lossy compression for scientific data by dynamic spline interpolation. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April. pp. 1643–1654. DOI:10.1109/ICDE51399.2021.00145

45.

Zhao

Lian

, et al. (2020) SDRBench: Scientific data reduction benchmark for lossy compressors. In: 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December. IEEE, pp. 2716–2724. DOI:10.1109/BigData50022.2020.9378449.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

6.72 MB