简体繁体 English

用Java压缩后缀数组

[英]Compressing Suffix Arrays in Java

原文 2012-02-23 03:50:32 5 2 java/ data-structures/ compression/ suffix-array

I have created a suffix array using the Princeton implementation. 我已经使用Princeton实现创建了一个后缀数组。 However, my basic text document is very, very large and the resulting suffix array is over 500mb in size. 但是，我的基本文本文档非常非常大，因此，后缀数组的大小超过500mb。 Is there a way to compress the suffix array? 有没有办法压缩后缀数组？

Thanks! 谢谢！

2 个解决方案

Contrary to what is said in the previous answer, you can not only compress suffix arrays, but in fact compressing suffix trees is usually implemented by first emulating the tree using a suffix array, and then compressing that. 与前面的答案相反，您不仅可以压缩后缀数组，而且实际上压缩后缀树通常是通过先使用后缀数组模拟树然后对其进行压缩来实现的。

I am not aware of any ready-to-use Java implementation of suffix array compression and the various existing algorithms are too involved to be described here in detail. 我不知道后缀数组压缩的任何现成的Java实现 ，并且涉及各种现有算法，因此在此不进行详细描述。 There is a paper by Navarro and Mäkinen (DOI 10.1145/1216370.1216372) which provides detailed descriptions and comparisons. Navarro和Mäkinen撰写了一篇论文 （DOI 10.1145 / 1216370.1216372），其中提供了详细的说明和比较。

But broadly speaking, there are two general approaches : 但是从广义上讲，有两种通用方法 ：

Approach A: Reducing the size of the array directly (see section 7.1 of the paper). 方法A：直接减小数组的大小 （请参见本文的7.1节）。 This involves storing only some of the entries of the suffix array, and interpolating the missing ones when needed. 这涉及仅存储后缀数组的某些条目，并在需要时插补丢失的条目。 The interpolation is carried out using a function (called ψ in the paper), which is itself stored in the form of a large array (but not as large as the original suffix array) and an indexed bit vector. 使用函数（在本文中称为ψ）进行插值，该函数本身以大数组（但不如原始后缀数组大）和索引位向量的形式存储。

Approach B: The FM approach (see section 9 of the paper). 方法B：FM方法 （请参阅本文第9节）。 Here, the suffix array is basically replaced with a relatively short array C that indicates starting positions (in the suffix array) of the main lexicographic buckets (ie groups of suffixes starting with the same initial character), combined with another relatively large data structure Occ that enables so called backward search . 在这里，后缀数组基本上由相对短的数组C代替，该数组C指示主要词典存储区（即以相同的初始字符开头的后缀组）的起始位置（在后缀数组中），并与另一个相对较大的数据结构Occ相结合启用所谓的向后搜索 。 Specifically, given a search pattern p=c ₁ ..c _m , it makes it possible to iteratively narrow the bucket for character c _m to a smaller bucket for string c _m-1 c _m , and then further to the bucket for c _m-2 c _m-1 c _m and so forth, until the final range for the complete pattern p is found. 具体地，给出的搜索模式p = C ₁ ..c _米，它使得有可能迭代地进一步缩小为串c _M-1 C _M更小的桶的桶为字符C _M，然后在铲斗为C _{M -2} c _m-1 c _m依此类推，直到找到完整图案p的最终范围。 The data structure Occ that enables this is large, but compressible using various techniques, most notably wavelet trees . 启用此功能的数据结构Occ很大，但可以使用各种技术（尤其是小波树）进行压缩。

Effects on search performance 对搜索效果的影响
The paper cited above contains careful analyses and comparisons. 上面引用的论文包含仔细的分析和比较。 But again broadly speaking, compressing the suffix array will cause the search for a pattern of length m (which can be O(m) in an uncompressed suffix array, if carefully implemented) to be delayed by a factor that depends (usually logarithmically) on the length of the entire text . 但是从广义上讲，压缩后缀数组将导致搜索长度为m的模式（如果仔细实施，则可以在未压缩的后缀数组中为O（m））延迟一个因数（通常为对数）整个文本的长度 。 Furthermore, any approach making use of wavelet trees means an additional dependence on the size of the alphabet . 此外，任何利用小波树的方法都意味着对字母大小的额外依赖 。

To my knowledge you can't compress suffix arrays (maybe you can I just don't know), but you can compress suffix trees. 据我所知，您不能压缩后缀数组（也许我只是不知道），但是可以压缩后缀树。 Thus you might consider changing your datastructures. 因此，您可以考虑更改数据结构。 Just Google compressed suffix tree. 只是Google压缩后缀树。

They are used heavily in genetic sequencing and for common substring problems because they can store lots of data. 由于它们可以存储大量数据，因此在遗传测序和常见的子字符串问题中大量使用。

An explanation can be found here: http://bioinformatics.oxfordjournals.org/content/23/5/629.abstract 可以在这里找到说明： http : //bioinformatics.oxfordjournals.org/content/23/5/629.abstract
If you follow the link at the bottom it takes you to this page where you can download the code for a compressed suffix tree: http://www.cs.helsinki.fi/group/suds/cst/ 如果您单击底部的链接，则会带您到此页面，您可以在此页面上下载压缩后缀树的代码： http : //www.cs.helsinki.fi/group/suds/cst/