ABSTRACT
Given a collection of strings S = {s1,..., s
n
} over an alphabet Σ, a superstring α of S is a string containing each s
i
as a substring, that is, for each i, 1 ≤ i ≤ n, α contains a block of |s
i
| consecutive characters that match s
i
exactly. The shortest superstring problem is the problem of finding a superstring α of minimum length. The shortest superstring problem has applications in both computational biology and data compression. The shortest superstring problem is NP-hard (Gallant et al., 1980); in fact, it was recently shown to be MAX SNP-hard (Blum et al., 1994). Given the importance of the applications, several heuristics and approximation algorithms have been proposed. Constant factor approximation algorithms have been given in Blum et al. (1994) (factor of 3), Teng and Yao (1993) (factor of 28/9), Czumaj et al. (1994) (factor of 25/6), and Kosaraju et al. (1994) (factor of 250/63). Informally, the key to any algorithm for the shortest superstring problem is to identify sets of strings with large amounts of similarity, or overlap. Although the previous algorithms and their analyses have grown increasingly sophisticated, they reveal remarkably little about the structure of strings with large amounts of overlap. In this sense, they are solving a more general problem than the one at hand. In this paper, we study the structure of strings with large amounts of overlap and use our understanding to give an algorithm that finds a superstring whose length is no more than 23/4 times that of the optimal superstring. Our algorithm runs in 0(|S| + n3) time, which matches that of previous algorithms. We prove several interesting properties about short periodic strings, allowing us to answer questions of the following form: Given a string with some periodic structure, characterize all the possible periodic strings that can have a large amount of overlap with the first string.