简体繁体 English

用Lucene索引zip文件

[英]Indexing zip files with Lucene

原文 2013-02-15 05:46:20 0 1 java/ lucene

Is it possible to index zipped folders in lucene. 是否可以在lucene中索引压缩的文件夹。 If i unzip it the content is too large. 如果我将其解压缩，则内容太大。 If i just index the bunch of zipped folders containing textfiles, The serach does not work properly. 如果我只是索引包含文本文件的一堆压缩文件夹，则Serach无法正常工作。 Is it possible for lucene to index with out extracting the zip file. Lucene是否可以在不提取zip文件的情况下进行索引。

1 个解决方案

Lucene is just a search library and there's no way it can "know" every possible scenario - eg how to index XML documents, word files, files inside .zip, files created by Chernobyl power plant, etc. Lucene只是一个搜索库，它无法“知道”所有可能的情况-例如如何索引XML文档，Word文件，.zip中的文件，切尔诺贝利电厂创建的文件等。

But what Lucene does it to provide the API for you to hook your data into Lucene. 但是Lucene所做的是为您提供API来将数据连接到Lucene中。

If unzipping the contents of the archive file is not an option, you could write a class that reads the zip file (but does not unzip it on the disk) and feeds this data into Lucene. 如果无法解压缩存档文件的内容，则可以编写一个类来读取zip文件（但不将其解压缩到磁盘上）并将该数据输入Lucene。

If your primary concern is the size of the index, there's nothing much you can do to reduce it. 如果您最关心的是索引的大小，那么您就无济于事了。 There are a few tips though: 但是有一些技巧：

try indexing without stopwords 尝试索引没有停用词
do not store the fields, only index them (hint: Field.Store.NO ) 不存储字段，仅对其进行索引（提示： Field.Store.NO ）
always lowercase all terms to reduce term count 始终小写所有术语以减少术语数