简体   繁体   English

lucene:如何执行增量索引并避免'删除和重做'

[英]lucene: how to perform an incremental indexing and avoid 'delete and redo'

I have a folder (MY_FILES) that has around 500 files and each day a new file arrives and it's placed there. 我有一个文件夹(MY_FILES),有大约500个文件,每天有一个新文件到达,它放在那里。 Size of each file is around 4Mb. 每个文件的大小约为4Mb。

I've just developed a simple 'void main' to test if I can search for a specific wildcard in those files. 我刚刚开发了一个简单的'void main'来测试我是否可以在这些文件中搜索特定的通配符。 It works just fine. 它工作得很好。

Problem is that I'm deleting the old indexed_folder and reindex again. 问题是我正在删除旧的indexed_folder并重新索引。 This takes a lot of time and obviously is inefficient. 这需要花费很多时间,显然效率低下。 What I'm looking for is an 'incremental indexing'. 我正在寻找的是“增量索引”。 Meaning, if the index exists already - just add the new files to the index. 意思是,如果索引已经存在 - 只需将新文件添加到索引中。

I was wondering if Lucene has some kind of mechanism to check if the 'doc' was indexed before trying to index it. 我想知道Lucene是否有某种机制来检查'doc'是否在尝试索引之前被编入索引。 Something like writer.isDocExists? 像writer.isDocExists这样的东西?

Thanks! 谢谢!

My code looks like this: 我的代码看起来像这样:

       // build the writer
       IndexWriter writer;
       IndexWriterConfig indexWriter = new IndexWriterConfig(Version.LUCENE_36, analyzer);
       writer = new IndexWriter(fsDir, indexWriter);
       writer.deleteAll();  //must - otherwise it will return duplicated result 
       //build the docs and add to writer
       File dir = new File(MY_FILES);
       File[] files = dir.listFiles();
       int counter = 0;
       for (File file : files) 
       { 
           String path = file.getCanonicalPath();
           FileReader reader = new FileReader(file);
           Document doc = new Document();  
           doc.add(new Field("filename", file.getName(), Field.Store.YES, Field.Index.ANALYZED));
           doc.add(new Field("path", path, Field.Store.YES, Field.Index.ANALYZED));
           doc.add(new Field("content", reader));  

           writer.addDocument(doc);
           System.out.println("indexing "+file.getName()+" "+ ++counter+"/"+files.length);
       }

First, you should use IndexWriter.updateDocument(Term, Document) instead of IndexWriter.addDocument to update documents, this will prevent your index from containing duplicated entries. 首先,您应该使用IndexWriter.updateDocument(Term, Document)而不是IndexWriter.addDocument来更新文档,这将阻止您的索引包含重复的条目。

To perform incremental indexing, you should add the last-modified time stamp to the documents of your index, and only index documents that are newer. 要执行增量索引,应将last-modified时间戳添加到索引的文档中,并且只索引较新的文档。

EDIT: more details on incremental indexing 编辑:有关增量索引的更多详细信息

Your documents should have at least two fields: 您的文档至少应包含两个字段:

  • the path of the file 文件的路径
  • the time stamp when the file has been modified for the last time. 最后一次修改文件时的时间戳。

Before starting indexing, just search your index for the latest time stamp and then crawl your directory to find all files whose time stamp is newer than the newest time stamp of the index. 在开始编制索引之前,只需在索引中搜索最新的时间戳,然后抓取您的目录以查找时间戳比索引的最新时间戳更新的所有文件。

This way, your index will be updated every time a file changes. 这样,每次文件更改时,您的索引都会更新。

If you want to check if your document is already present in the index, one method could be to generate the associated Lucene query which you will use with an IndexSearcher to search the Lucene index. 如果要检查文档是否已存在于索引中,可以使用一种方法生成关联的Lucene查询,该查询将与IndexSearcher一起使用以搜索Lucene索引。

For instance, here, you can build a query using the fields filename , path and content to check whether the document is already present in the index. 例如,在这里,您可以使用字段filenamepathcontent构建查询,以检查文档是否已存在于索引中。

You will need an IndexSearcher besides your IndexWriter and follows the Lucene query syntax to generate the full text query you will provide to Lucene (eg 除了IndexWriter之外,您还需要一个IndexSearcher并遵循Lucene查询语法来生成您将提供给Lucene的全文查询(例如

 filename:myfile path:mypath content:mycontent

). )。

IndexSearcher indexSearcher = new IndexSearcher(directory);

String query = // generate your query

indexSearcher.search(query, collector);

In the code above, collector contains a callback method collect which will be called with a document id if some data in the index matches the query. 在上面的代码中, collector包含一个回调方法collect,如果索引中的某些数据与查询匹配,将使用文档id调用该方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM