Abstract
Introduction
The vision of miniaturized sequencing devices is turning into reality with the emergence of MinION by Oxford Nanopore. 1 Such devices are promising in a variety of potential applications, ranging from studying of wildlife and clinical capture of sequenced genes to food inspection for identifying pathogens. However, such portable devices are commonly subject to the constraints in processing capabilities, power budget, and storage and communication limitations. With these constraints, the traditional view of genome compression architecture as simple decoder and complex encoder needs to be changed. It is urgent to develop novel techniques to satisfy the emerging reality challenges. Data compression methods (for reducing the storage space with significantly lower computational complexity and memory requirements) become crucial for the efficient management of genomic data in portable devices.
In both situations (with or without reference sequences), traditional genome compression is computationally expensive at the encoder. The complexity is dominated by matching (approximately) repeated patterns of nucleotides – namely adenine (A), cytosine (C), guanine (G), and thymine (T) – between or within the DNA sequences. These patterns are also accompanied by insertions, deletions, and substitutions of single nucleotides.
To date, a number of specialized DNA sequence compression algorithms have been proposed. Following Ziv and Lempel's ideas, 2 Grumbach and Tahi 3 proposed the first DNA sequence compressor, Biocompress, to compress the exact repeating patterns with a specially designed Fibonacci coder. The algorithm was then improved in Biocompress-24 by introducing a Markov model for encoding the nonrepeated regions. Chen et al.5,6 extended the earlier approach to cover approximated repeats by further exploiting the nature of DNA sequences. Meanwhile, the work in Ref. 7 introduced a combined CTW + LZ algorithm for searching approximate repeats and palindrome using hash and dynamic programing. Behzadi and Fessant 8 proposed a dynamic programing approach for the optimal selection of approximate repeats with promising compression efficiency being witnessed. However, such methods are heuristic as the underlying statistics of the sequence patterns are generally ignored. The authors in Refs. 9–11 proposed to combine the matching and substitution of approximate repeats and a specific normalized maximum likelihood model, obtaining a much higher compression ratio. Subsequently, statistical modeling for predicting the generation of symbols and arithmetic coding for such symbols in DNA sequences were proposed for more efficient compression. Cao et al. 12 proposed to estimate the probability distribution of symbols with a panel of experts to tackle the approximate repeat problem. Alternatively, finite context models are proposed to capture different aspects of statistical information along the sequence;13,14 such reference-free methods are plagued by their low compression rates (not >6:1) and prohibitive computational consumption for large DNA sets.
Recognizing reference-free architectures do not fully utilize information, a series of algorithms are proposed to compress sequences by matching approximate repeats with a reference sequence. The RLZ algorithm proposed by Kuruppu et al. 15 performed relative Lempel–Ziv compression of DNA sequences with the collection of related sequences. Wang and Zhang 16 proposed the GRS compressor, which is able to compress a sequence using a reference without any additional information. Applying the copy model into the matching of exact repeats in reference sequences, GReEn 17 achieved even larger gains when compared to that in Refs. 15 and 16. Recently, reference-based algorithms18,19 achieved highly efficient compression performance for the fastq data format, by matching and comparing repeated subsequences in the reference sequences. Although the reference-based architectures can achieve hundreds of folds compression, the requirement of reference sequences makes it impractical for miniaturized devices, which have very limited storage space and communication bandwidth.
In this study, we propose a novel and pioneering architecture for the genome compression application in miniaturized devices with limited processing capabilities, power budget, storage space, and communication bandwidth. The contribution of this paper is threefold.
First, to the best of our knowledge, the proposed architecture is the first practical one to meet the demands of miniaturized devices. Motivated by the distributed source coding (DSC) for sensor networks, 20 the proposed scheme includes a simplified encoder without having access to reference sequences or communicating with other encoders, and a complex decoder that detects repeated subsequences in the stored reference sequences and decompress the received encoded bits with the specifically designed graphical model. Hence, the proposed compression system can successfully meet the constraints and requirements of the miniaturized sequencing devices.
Second, a flexible encoding and decoding mechanism is proposed. Using feedback from the decoder, the encoder transmits either hashes conducting the detection of variable-size exact repeats in decoder or syndromes obtained with low-complexity Slepian–Wolf (SW) coding 21 of the nonrepeated subsequences. The proposed encoder and decoder perform efficiently by considering both exact repeats and approximately repeated subsequences (eg, insertion, deletion, and substitution).
Third, 21 the syndrome and reference sequences at the decoder, we construct a novel factor graph model to tackle the challenge in detecting insertion, deletion, and substitution between the reference and original source. Experimental results show that the proposed architecture can achieve an efficient compression performance with significantly low encoding complexity when compared to the benchmark compressor GRS.
The rest of this study is organized as follows. In the System Architecture section, we introduce the proposed DSC-based genome compression system, which includes the implementation details of the proposed hash-based (exactly) repeated sequence coding with adaptive length and an overview of syndrome-based nonrepeated sequence coding. Then, in the the Syndrome-based Nonrepeated Sequence Coding section, we focus on the design of the coding, which can handle the insertion, deletion, and substitution between sources and reference. The experimental results and concluding remarks can be found in the last two sections.
System Architecture
The block diagram of the proposed genome compression framework is depicted in Figure 1. Suppose that there are two correlated DNA sequences (ie, source and reference sequences) available at the encoder and decoder, respectively, the variations between the two sequences are modeled by insertion, deletion, and substitution. The alphabet of our studied DNA sequence is confined within the set {“

Workflow of genome compression based on DSC.

Workflow of genome compression based on DSC.
At the encoder (see the left hand side of Fig. 2), a streaming DNA sequence obtained from the portable sequencer will be first stored in the incoming data buffer for further processing. Second, a sub-sequence
At the decoder (see the right hand side of Fig. 2), the received streaming data in the incoming data buffer will be processed by one of the following modules based on the corresponding data compression mode (ie, either hash bits or syndromes).
For the received hash data
If a matched hash
The diagram of hash-based coding.
If no matched hash can be detected, the following two conditions will be checked.
if
Otherwise, the decoder also informs the hash matching failure to the encoder, but requests a shorter code length by setting
For the received syndromes, the decoder will pass the syndrome to the proposed factor graph-based LDPCA decoder with the capability of handling deletion, insertion, and substitution between the source and the reference. (See the Syndrome-based Nonrepeated Sequence Coding section for more implementation details). The following two conditions will be checked.
If the decoded source
Otherwise, the decoder will request additional LDPCA syndromes from the encoder.
Syndrome-based Nonrepeated Sequence Coding
As previously mentioned in our system architecture, if an exact repeat cannot be identified by hash coding, the decoder will request syndromes from the encoder through a feedback channel. In this section, we introduce the codec design of the proposed syndrome-based nonrepeated sequence coding.
Syndrome-based nonrepeated sequence encoding
The first step of the proposed syndrome-based nonrepeat encoder is to convert DNA data into a binary source, such that they can be compressed under a binary LDPCA encoder. Suppose the following mapping rule for the letters within the alphabet, ie, “
Syndrome-based non-repeated sequence decoding
To perform syndrome-based decoding for non-repeat DNA subsequence x with the reference sequence as side information y, the key factor is to be able to explore the variations between the source subsequence x and the reference sequence y, where the variations are modeled by the insertion, deletion, and substitution between the source and reference. Moreover, a substitution can be expressed as an insertion in the source sequence followed by a deletion in the corresponding location in the reference sequence. In this section, we demonstrate that such variations can be effectively estimated through Bayesian inference on graphical models. The graphical model of our proposed syndrome-based decoding with variation is depicted in Figure 4. In Figure 4, the variable nodes (usually depicted by a circle) denote variables such as source symbol, binary source bits, local offset introduced by variation, and syndromes. Besides, factor nodes (depicted by squares) represent the relationship among the connected variable nodes. In the rest of this section, we will describe how to construct the proposed factor graph for the DNA sequence decoding with variations. We first study the parity check constraint imposed by the received syndromes, where

Factor graph of genome compression based on DSC.
Moreover,
Moreover, since the alphabet is not uniformly distributed in an arbitrary DNA sequence, the prior distribution for the alphabet is captured by the factor node
Now, we introduce an additional erasure variable node
For the existence of matches
The factor node
For
Finally, in the context of Bayesian inference, estimating unknown variables corresponds to the evaluations of their marginal distribution, which can be efficiently achieved by performing message passing on a factor graph. With the factor graph defined in Figure 4, the original DNA sequence can be recovered by the estimated posterior distribution of each source
Results
Two genome sequence data, The Arabidopsis Information Resource (TAIR) 24 and The Institute for Genomic Research (TIGR), 25 are adopted in our experiments. These two databases are collected by professional groups or institutes, and have been widely used by research communities.
The TAIR maintains a database of genetic and molecular biology data for the model higher plant Arabidopsis thaliana.
24
In this experiment, we test TAIR8
22
dataset and TAIR9
23
dataset, where each dataset contains five chromosomes with over 238 million bases in total. Moreover, the genome of TAIR9 is used for testing our compression performance with TAIR8 as reference only available at the decoder. For this experiment, all the hyper-parameters are initialized as follows, the initial code length
First, the empirical marginal statistics of the DNA bases (“A”, “T”, “G”, “C”, “N”} and those of the local offsets

The empirical statistics of (
Figure 6A illustrates the relationship between the average code rates and the different maximum local offsets in syndrome coding based on all five chromosomes. In Figure 6A, we can see that the code rates decrease as the maximum local offsets increase, due to the fact that a larger maximum local offset offers a wider search region for exploring the reference. However, a larger maximum local offset may also result in a higher decoding complexity. Figure 6B shows the overall compression performance (ie, hash bits + syndromes) for all five chromosomes in terms of compressed file size. Moreover, Figure 7 shows a side-by-side comparison of the compression rate and compression time. We can see that both the proposed method and GRS algorithm achieve significant file size reductions (ie, up to 8252 x file size reduction).

Compression performance of the proposed codec on TAIR dataset, (

Performance comparison between GRS and our proposed codec on TAIR dataset.
For the TIGR data, we tested the chromosome 4 (35.8MB) of the TIGR5 dataset using the chromosome 4 of the TIGR6 as the reference by varying the LDPC code length (ie, 528, 1056, 1584, 2112, and 2640) as shown in Table 1. Moreover, we also implemented GRS method on this dataset, and the result was also listed in Table 2 as reference. We can see that the compression performance decreases as the LDPC code length increases. In this case, the proposed algorithm achieved better compression performance when compared with GRS. It is because that the reference chromosome 4 in TIGR6 includes a significant amount of insertions when compared with the same chromosome in TIGR5, where the insertion information in the reference chromosome has no contribution to the size of DSC compressed data.
Performances of our proposed method on chromosome 4 of TIGR5 (35.8 MB).
Performance of GRS on chromosome 4 of TIGR5 (35.8 MB).
Our proposed encoder shows a significantly lower encoding complexity. It is worth mentioning that the proposed codec is implemented by MATLAB, where a potential performance boost is highly expected by using more efficient programing languages eg, C/C++. To the best of our knowledge, this is the first study of DSC-based genome compression. There is no doubt that it opens many possibilities for the portable miniaturized applications in which energy consumption and bandwidth usage are of paramount importance.
Conclusion
In this study, we present a DSC-based genome compression architecture. To the best of our knowledge, the proposed framework is the first study of its kind: specially targeted at the low-complexity genome encoding for miniaturized devices, which have limited processing capabilities, power budget, storage space, and communication bandwidth. Compared with traditional reference-based DNA compression algorithm (eg, GRS), the proposed framework offers ultra-low encoding complexity (nonrepeated subsequences are encoded using low-complexity DSC encoding), while (exactly) repeated subsequences are compressed through adaptive length hash coding based on the decoder feedback. The customized factor graph-based decoder tackles the challenges of detecting insertion, deletion, and substitution between the reference and the original source, and it recovers the nonrepeated subsequences based on received syndromes. Last but not least, our proposed genome compression framework incorporates LDPCA codes for rate adaptive decoding. Experimental results show that the proposed architecture could achieve an efficient compression performance with significantly lower encoding complexity when compared to the benchmark compressor GRS.
Author Contributions
Conceived and designed the experiments: SW, XJ, FC. Analyzed the data: FC. Wrote the first draft of the manuscript: SW, XJ, LC. Contributed to the writing of the manuscript: XJ, FC, SC. Agree with manuscript results and conclusions: XJ, FC, SC. Jointly developed the structure and arguments for the paper: SW, FC, LC, SC. Made critical revisions and approved final version: FC, XJ, SC. All authors reviewed and approved of the final manuscript.
