简体繁体中英

Fast search in compressed text files

原文 2011-04-06 06:32:32 1 5 c++/ algorithm/ full-text-search/ compression/ huffman-code

I need to be able to search for text in a large number of files (.txt) that are zipped. Compression may be changed to something else or even became proprietary. I want to avoid unpacking all files and compress (encode) the search string and search in compressed files. This should be possible using Huffman compression with the same codebook for all files. I don't want to re-invent the wheel, so .. anyone knows a library that does something like this or Huffman algorithm that is implemented and tested, or maybe a better idea ?

thanks in advance

5 answers

Most text files are compressed with one of the LZ-family of algorithms, which combine a Dictionary Coder together with an Entropy Coder such as Huffman.

Because the Dictionary Coder relies on a continuously-updated "dictionary", its coding result is dependent on the history (all codes in the dictionary that is derived from the input data up to the current symbol), so it is not possible to jump into a certain location and start decoding, without first decoding all of the previous data.

In my opinion, you can just use a zlib stream decoder which returns decompressed data as it goes without waiting for the entire file to be decompressed. This will not save execution time but will save memory.

A second suggestion is to do Huffman coding on English words, and forget about the Dictionary Coder part. Each English word gets mapped to a unique prefix-free code.

Finally, @SHODAN gave the most sensible suggestion, which is to index the files, compress the index and bundle with the compressed text files. To do a search, decompress just the index file and look up the words. This is in fact an improvement over doing the Huffman coding on words - once you found the frequency of words (in order to assign the prefix code optimally), you have already built the index, so you can keep the index for searching.

It is unlikely you'll be able to search for uncompressed strings in a compressed file. I guess one for your best options is to index the files somehow. Using Lucene perhaps?

Searching for text in compressed files can be faster than searching for the same thing in uncompressed text files.

One compression technique I've seen that sacrifices some space in order to do fast searches:

maintain a dictionary with 2^16 entries of every word in the text. Reserve the first 256 entries for literal bytes, in case you come upon a word that isn't in the dictionary -- even though many large texts have fewer than 32,000 unique words, so they never need to use those literal bytes.
Compress the original text by substituting the 16-bit dictionary index for each word.
(optional) In the normal case that two words are separated by a single space character, discard that space character; otherwise put all the bytes in the string between words into the dictionary as a special "word" (for example, ". " and ", " and "\\n") tagged with the "no default spaces" attribute, and then "compress" those strings by replacing them with the corresponding dictionary index.
Search for words or phrases by compressing the phrase in the same way, and searching for the compressed string of bytes in the compressed text in exactly the same way you would search for the original string in the original text.

In particular, searching for a single word usually reduces to comparing the 16-bit index in the compressed text, which is faster than searching for that word in the original text, because

each comparison requires comparing fewer bytes -- 2, rather than however many bytes were in that word, and
we're doing fewer comparisons, because the compressed file is shorter.

Some kinds of regular expressions can be translated to another regular expression that directly finds items in the compressed file (and also perhaps also finds a few false positives). Such a search also does fewer comparisons than using the original regular expression on the original text file, because the compressed file is shorter, but typically each regular expression comparison requires more work, so it may or may not be faster than the original regex operating on the original text.

(In principle you could replace the fixed-length 16-bit codes with variable-length Huffman prefix codes, as rwong mentioned -- the resulting compressed file would be smaller, but the software to deal with those files would be a little slower and more complicated).

For more sophisticated techniques, you might look at

MG4J: Managing Gigabytes for Java
"Managing Gigabytes: Compressing and Indexing Documents and Images" by Ian H. Witten, Alistair Moffat, and Timothy C. Bell

I may be completely wrong here, but I don't think there'd be a reliable way to search for a given string without decoding the files. My understanding of compressions algorithms is that the bit-stream corresponding to a given string would depend greatly on what comes before the string in the uncompressed file. You may be able to find a given encoding for a particular string in a given file, but I'm pretty sure it wouldn't be consistent between files.

This is possible, and can be done quite efficiently. There's a lot of exciting research on this topic, more formally known as a Succinct data structure. Some topics I would recommend looking into: Wavelet tree, FM-index/RRR, succinct suffix arrays. You can also efficiently search Huffman encoded strings, as a number of publications have demonstrated.

parsing compressed files with flex & bison?

In NTFS Compressed Directory, How to read Files compressed and uncompressed size?

Fast string search?

Fast in string search

How to search and display text files in c++

Search words in two text files c++

Correct data structure for fast insert and fast search?

Parse files the fast way?

Send and receiving compressed files with boost over socket

How to read compressed files c++

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question parsing compressed files with flex & bison? In NTFS Compressed Directory, How to read Files compressed and uncompressed size? Fast string search? Fast in string search How to search and display text files in c++ Search words in two text files c++ Correct data structure for fast insert and fast search? Parse files the fast way? Send and receiving compressed files with boost over socket How to read compressed files c++

Related Tags

Fast search in compressed text files

Question

5 answers

solution1
8 ACCPTED 2011-04-06 06:51:56

solution2
3 2011-04-06 06:42:36

solution3
3 2011-07-22 21:17:06

solution4
2 2011-04-06 06:40:31

solution5
0 2017-12-28 06:52:45

Fast search in compressed text files

Question

5 answers

solution1 8 ACCPTED 2011-04-06 06:51:56

solution2 3 2011-04-06 06:42:36

solution3 3 2011-07-22 21:17:06

solution4 2 2011-04-06 06:40:31

solution5 0 2017-12-28 06:52:45

solution1
8 ACCPTED 2011-04-06 06:51:56

solution2
3 2011-04-06 06:42:36

solution3
3 2011-07-22 21:17:06

solution4
2 2011-04-06 06:40:31

solution5
0 2017-12-28 06:52:45