Sage Journals: Discover world-class research

Abstract

Due to the significant amount of DNA data that are being generated by next-generation sequencing machines for genomes of lengths ranging from megabases to gigabases, there is an increasing need to compress such data to a less space and a faster transmission. Different implementations of Huffman encoding incorporating the characteristics of DNA sequences prove to better compress DNA data. These implementations center on the concepts of selecting frequent repeats so as to force a skewed Huffman tree, as well as the construction of multiple Huffman trees when encoding. The implementations demonstrate improvements on the compression ratios for five genomes with lengths ranging from 5 to 50 Mbp, compared with the standard Huffman tree algorithm. The research hence suggests an improvement on all such DNA sequence compression algorithms that use the conventional Huffman encoding. The research suggests an improvement on all DNA sequence compression algorithms that use the conventional Huffman encoding. Accompanying software is publicly available (AL-Okaily, 2016).

Get full access to this article

View all access options for this article.

References

Ahlswede

, Baumer

, Cai

, Aydinian

, Blinovsky

, Deppe

, and Mashurian

(2006). Identification entropy. In General Theory of Information Transfer and Combinatorics. Springer.

AL-Okaily

(2016). Unbalanced huffman tree. https://github.com/aalokaily/Unbalanced-Huffman-Tree.

Allison

, Edgoose

, and Dix

T.I.

(1998). Compression of strings with approximate repeats. In Proceedings of the 6th International Conference on Intelligent Systems for Molecular Biology (Canada), pages 8–16. AAAI Press.

Bakr

N.S.

, and Sharawi

A.A.

(2013). DNA lossless compression algorithms: Review. American Journal of Bioinformatics Research, 3, 72–81.

Cao

M.D.

, Dix

T.I.

, Allison

, and Mears

(2007). A simple statistical algorithm for biological sequence compression. In Data Compression Conference (Utah), pages 43–52. IEEE.

Chen

, Kwong

, and Li

(2000). A compression algorithm for DNA sequences and its applications in genome comparison. In Proceedings of the Fourth annual International Conference on Computational Molecular Biology (Japan), page 107. ACM.

Chen

, Li

, Ma

, and Tromp

(2002). DNAcompress: Fast and effective DNA sequence compression. Bioinformatics, 18(12), 1696–1698.

Cherniavsky

, and Ladner

(2004). Grammar-based compression of DNA sequences. In Proceedings of the DIMACS Working Group on the Burrows-Wheeler Transform (New Jersey). Citeseer.

Gailly

J.-L.

and Adler

(2003). The gzip home page. 27th July.

10.

Glassey

, and Karp

(1976). On the optimality of Huffman trees. SIAM Journal on Applied Mathematics, 31(2), 368–378.

11.

Grumbach

, and Tahi

(1994a). Compression of dna sequences.

12.

Grumbach

, and Tahi

(1994b). A new challenge for compression algorithms: Genetic sequences. Information Processing and Management, 30(6), 875–886.

13.

Huffman

D.A.

, et al. (1952). A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9), 1098–1101.

14.

Loewenstern

, and Yianilos

P.N.

(1999). Significantly lower entropy estimates for natural DNA sequences. Journal of Computational Biology, 6(1), 125–142.

15.

Longo

, and Galasso

(1982). An application of informational divergence to Huffman codes. IEEE Transactions on Information Theory, 28(1), 36–43.

16.

, Tromp

, and Li

(2002). Patternhunter: Faster and more sensitive homology search. Bioinformatics, 18(3), 440–445.

17.

NCBI. (2001). Vibrio cholerae O1 biovar El Tor str. N16961, taxid = 243277.

18.

NCBI. (2008). Mycobacterium abscessus ATCC 19977, taxid = 561007.

19.

NCBI. (2011). Saccharomyces cerevisiae S288c, taxid = 559292.

20.

NCBI. (2013). Neurospora crassa OR74A, taxid = 367110.

21.

NCBI. (2015). Chr22.

22.

Parker

D.S.

Jr (1980). Conditions for optimality of the Huffman algorithm. SIAM Journal on Computing, 9(3), 470–489.

23.

Rajeswari

P.R.

, Apparao

, and Kumar

R.K.

(2010). Huffbit compress—Algorithm to compress DNA sequences using extended binary trees. Journal of Theoretical and Applied Information Technology, 13(2), 101–106.

24.

Rivals

, Delahaye

J.-P.

, Dauchet

, and Delgrange

(1997). Fast discerning repeats in DNA sequences with a compression algorithm. Genome Informatics. 8, 215–226.

25.

Saada

, and Zhang

(2015). DNA sequences compression algorithms based on the two bits codation method. In Proceedings of the International Conference on Bioinformatics and Biomedicine (Washington DC), pages 1684–1686. IEEE.

26.

Tembe

, Lowey

, and Suh

(2010). G-sqz: Compact encoding of genomic sequence and quality data. Bioinformatics, 26(17), 2192–2194.

Toward a Better Compression for DNA Sequences Using Huffman Encoding

Abstract

Abstract

Get full access to this article

References