简体繁体 English

快速搜索压缩文本文件

[英]Fast search in compressed text files

原文 2011-04-06 06:32:32 5 5 c++/ algorithm/ full-text-search/ compression/ huffman-code

I need to be able to search for text in a large number of files (.txt) that are zipped. 我需要能够在压缩的大量文件（.txt）中搜索文本。 Compression may be changed to something else or even became proprietary. 压缩可以改为其他东西，甚至可以变成专有的。 I want to avoid unpacking all files and compress (encode) the search string and search in compressed files. 我想避免解压缩所有文件并压缩（编码）搜索字符串并在压缩文件中搜索。 This should be possible using Huffman compression with the same codebook for all files. 这应该可以使用霍夫曼压缩与所有文件的相同码本。 I don't want to re-invent the wheel, so .. anyone knows a library that does something like this or Huffman algorithm that is implemented and tested, or maybe a better idea ? 我不想重新发明轮子，所以..任何人都知道像这样的库或者实施和测试过的霍夫曼算法，或者更好的想法？

thanks in advance 提前致谢

5 个解决方案

Most text files are compressed with one of the LZ-family of algorithms, which combine a Dictionary Coder together with an Entropy Coder such as Huffman. 大多数文本文件都是使用LZ系列算法压缩的，这些算法将字典编码器与熵编码器（如Huffman）结合在一起。

Because the Dictionary Coder relies on a continuously-updated "dictionary", its coding result is dependent on the history (all codes in the dictionary that is derived from the input data up to the current symbol), so it is not possible to jump into a certain location and start decoding, without first decoding all of the previous data. 因为字典编码器依赖于不断更新的“字典”，其编码结果取决于历史（字典中从输入数据直到当前符号的所有代码），因此无法跳转到某个位置并开始解码，而不首先解码所有先前的数据。

In my opinion, you can just use a zlib stream decoder which returns decompressed data as it goes without waiting for the entire file to be decompressed. 在我看来，你可以使用一个zlib流解码器，它可以随时返回解压缩数据，而无需等待整个文件解压缩。 This will not save execution time but will save memory. 这不会节省执行时间，但会节省内存。

A second suggestion is to do Huffman coding on English words, and forget about the Dictionary Coder part. 第二个建议是对英语单词进行霍夫曼编码，并忘记字典编码器部分。 Each English word gets mapped to a unique prefix-free code. 每个英语单词都映射到一个唯一的无前缀代码。

Finally, @SHODAN gave the most sensible suggestion, which is to index the files, compress the index and bundle with the compressed text files. 最后，@ SHODAN给出了最明智的建议，即索引文件，压缩索引并捆绑压缩文本文件。 To do a search, decompress just the index file and look up the words. 要进行搜索，只需解压缩索引文件并查找单词。 This is in fact an improvement over doing the Huffman coding on words - once you found the frequency of words (in order to assign the prefix code optimally), you have already built the index, so you can keep the index for searching. 这实际上是对单词执行霍夫曼编码的改进 - 一旦找到单词的频率（为了最佳地分配前缀代码），您已经构建了索引，因此您可以保留索引以进行搜索。

It is unlikely you'll be able to search for uncompressed strings in a compressed file. 您不太可能在压缩文件中搜索未压缩的字符串。 I guess one for your best options is to index the files somehow. 我猜你最好的选择之一是以某种方式索引文件。 Using Lucene perhaps? 或许使用Lucene？

Searching for text in compressed files can be faster than searching for the same thing in uncompressed text files. 在压缩文件中搜索文本比在未压缩文本文件中搜索相同内容要快。

One compression technique I've seen that sacrifices some space in order to do fast searches: 我见过的一种压缩技术为了快速搜索而牺牲了一些空间：

maintain a dictionary with 2^16 entries of every word in the text. 维护一个字典，其中包含文本中每个单词的2 ^ 16个条目。 Reserve the first 256 entries for literal bytes, in case you come upon a word that isn't in the dictionary -- even though many large texts have fewer than 32,000 unique words, so they never need to use those literal bytes. 保留文字字节的前256个条目，以防你遇到一个不在字典中的单词 - 即使许多大文本的单个字少于32,000个，所以它们永远不需要使用这些字面字节。
Compress the original text by substituting the 16-bit dictionary index for each word. 通过将16位字典索引替换为每个单词来压缩原始文本。
(optional) In the normal case that two words are separated by a single space character, discard that space character; （可选）在正常情况下，两个单词由单个空格字符分隔，丢弃该空格字符; otherwise put all the bytes in the string between words into the dictionary as a special "word" (for example, ". " and ", " and "\\n") tagged with the "no default spaces" attribute, and then "compress" those strings by replacing them with the corresponding dictionary index. 否则将单词之间的字符串中的所有字节放入字典中作为特殊的“单词”（例如，“。”和“，”和“\\ n”）标记为“无默认空格”属性，然后“压缩” “用相应的字典索引替换它们的那些字符串。
Search for words or phrases by compressing the phrase in the same way, and searching for the compressed string of bytes in the compressed text in exactly the same way you would search for the original string in the original text. 通过以相同方式压缩短语来搜索单词或短语，并在压缩文本中搜索压缩的字节串，其方式与在原始文本中搜索原始字符串的方式完全相同。

In particular, searching for a single word usually reduces to comparing the 16-bit index in the compressed text, which is faster than searching for that word in the original text, because 特别是，搜索单个单词通常会减少比较压缩文本中的16位索引，这比在原始文本中搜索该单词要快，因为

each comparison requires comparing fewer bytes -- 2, rather than however many bytes were in that word, and 每次比较都需要比较更少的字节 - 2，而不是那个字中有多少字节，和
we're doing fewer comparisons, because the compressed file is shorter. 我们进行的比较较少，因为压缩文件较短。

Some kinds of regular expressions can be translated to another regular expression that directly finds items in the compressed file (and also perhaps also finds a few false positives). 某些正则表达式可以转换为另一个直接在压缩文件中查找项目的正则表达式（也可能还会发现一些误报）。 Such a search also does fewer comparisons than using the original regular expression on the original text file, because the compressed file is shorter, but typically each regular expression comparison requires more work, so it may or may not be faster than the original regex operating on the original text. 这样的搜索也比在原始文本文件上使用原始正则表达式做的更少比较，因为压缩文件更短，但通常每个正则表达式比较需要更多工作，因此它可能会或可能不会比原始正则表达式更快地运行原文。

(In principle you could replace the fixed-length 16-bit codes with variable-length Huffman prefix codes, as rwong mentioned -- the resulting compressed file would be smaller, but the software to deal with those files would be a little slower and more complicated). （原则上你可以用可变长度的霍夫曼前缀代码替换固定长度的16位代码，正如rwong所提到的 - 生成的压缩文件会更小，但处理这些文件的软件会慢一点，而且更多复杂）。

For more sophisticated techniques, you might look at 对于更复杂的技术，您可能会看一下

MG4J: Managing Gigabytes for Java MG4J：管理千兆字节的Java
"Managing Gigabytes: Compressing and Indexing Documents and Images" by Ian H. Witten, Alistair Moffat, and Timothy C. Bell Ian H. Witten，Alistair Moffat和Timothy C. Bell撰写的“管理千兆字节：压缩和索引文档和图像”

I may be completely wrong here, but I don't think there'd be a reliable way to search for a given string without decoding the files. 我可能在这里完全错了，但我认为没有可靠的方法来搜索给定的字符串而不解码文件。 My understanding of compressions algorithms is that the bit-stream corresponding to a given string would depend greatly on what comes before the string in the uncompressed file. 我对压缩算法的理解是，对应于给定字符串的比特流很大程度上取决于未压缩文件中字符串之前的内容。 You may be able to find a given encoding for a particular string in a given file, but I'm pretty sure it wouldn't be consistent between files. 您可能能够在给定文件中找到特定字符串的给定编码，但我很确定它们在文件之间不一致。

This is possible, and can be done quite efficiently. 这是可能的，并且可以非常有效地完成。 There's a lot of exciting research on this topic, more formally known as a Succinct data structure. 关于这个主题有很多令人兴奋的研究，更正式地称为简洁的数据结构。 Some topics I would recommend looking into: Wavelet tree, FM-index/RRR, succinct suffix arrays. 我建议考虑一些主题：小波树，FM索引/ RRR，简洁后缀数组。 You can also efficiently search Huffman encoded strings, as a number of publications have demonstrated. 您还可以有效地搜索霍夫曼编码的字符串，正如许多出版物所证明的那样。