简体   繁体   中英

Compressing Suffix Arrays in Java

I have created a suffix array using the Princeton implementation. However, my basic text document is very, very large and the resulting suffix array is over 500mb in size. Is there a way to compress the suffix array?

Thanks!

Contrary to what is said in the previous answer, you can not only compress suffix arrays, but in fact compressing suffix trees is usually implemented by first emulating the tree using a suffix array, and then compressing that.

I am not aware of any ready-to-use Java implementation of suffix array compression and the various existing algorithms are too involved to be described here in detail. There is a paper by Navarro and Mäkinen (DOI 10.1145/1216370.1216372) which provides detailed descriptions and comparisons.

But broadly speaking, there are two general approaches :

Approach A: Reducing the size of the array directly (see section 7.1 of the paper). This involves storing only some of the entries of the suffix array, and interpolating the missing ones when needed. The interpolation is carried out using a function (called ψ in the paper), which is itself stored in the form of a large array (but not as large as the original suffix array) and an indexed bit vector.

Approach B: The FM approach (see section 9 of the paper). Here, the suffix array is basically replaced with a relatively short array C that indicates starting positions (in the suffix array) of the main lexicographic buckets (ie groups of suffixes starting with the same initial character), combined with another relatively large data structure Occ that enables so called backward search . Specifically, given a search pattern p=c 1 ..c m , it makes it possible to iteratively narrow the bucket for character c m to a smaller bucket for string c m-1 c m , and then further to the bucket for c m-2 c m-1 c m and so forth, until the final range for the complete pattern p is found. The data structure Occ that enables this is large, but compressible using various techniques, most notably wavelet trees .

Effects on search performance
The paper cited above contains careful analyses and comparisons. But again broadly speaking, compressing the suffix array will cause the search for a pattern of length m (which can be O(m) in an uncompressed suffix array, if carefully implemented) to be delayed by a factor that depends (usually logarithmically) on the length of the entire text . Furthermore, any approach making use of wavelet trees means an additional dependence on the size of the alphabet .

To my knowledge you can't compress suffix arrays (maybe you can I just don't know), but you can compress suffix trees. Thus you might consider changing your datastructures. Just Google compressed suffix tree.

They are used heavily in genetic sequencing and for common substring problems because they can store lots of data.

An explanation can be found here: http://bioinformatics.oxfordjournals.org/content/23/5/629.abstract
If you follow the link at the bottom it takes you to this page where you can download the code for a compressed suffix tree: http://www.cs.helsinki.fi/group/suds/cst/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM