简体   繁体   中英

Indexing zip files with Lucene

Is it possible to index zipped folders in lucene. If i unzip it the content is too large. If i just index the bunch of zipped folders containing textfiles, The serach does not work properly. Is it possible for lucene to index with out extracting the zip file.

Lucene is just a search library and there's no way it can "know" every possible scenario - eg how to index XML documents, word files, files inside .zip, files created by Chernobyl power plant, etc.

But what Lucene does it to provide the API for you to hook your data into Lucene.

If unzipping the contents of the archive file is not an option, you could write a class that reads the zip file (but does not unzip it on the disk) and feeds this data into Lucene.

If your primary concern is the size of the index, there's nothing much you can do to reduce it. There are a few tips though:

  • try indexing without stopwords
  • do not store the fields, only index them (hint: Field.Store.NO )
  • always lowercase all terms to reduce term count

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM