简体   繁体   English

增量索引lucene

[英]incremental indexing lucene

I'm making an application in Java using Lucene 3.6 and want to make an incremental rate. 我正在使用Lucene 3.6在Java中创建应用程序,并且希望提高增量率。 I have already created the index, and I read that you have to do is open the existing index, and check each document indexing and document modification dates to see if they differ delete the index file and re-add again. 我已经创建了索引,并且我读到你要做的就是打开现有的索引,并检查每个文档的索引和文档修改日期,看它们是否不同,删除索引文件并重新添加。 My problem is I do not know how to do that in Java Lucene. 我的问题是我不知道如何在Java Lucene中这样做。

Thanks 谢谢

My code is: 我的代码是:

public static void main(String[] args) 
    throws CorruptIndexException, LockObtainFailedException,
           IOException {

    File docDir = new File("D:\\PRUEBASLUCENE");
    File indexDir = new File("C:\\PRUEBA");

    Directory fsDir = FSDirectory.open(indexDir);
    Analyzer an = new StandardAnalyzer(Version.LUCENE_36);
    IndexWriter indexWriter
        = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);


    long numChars = 0L;
    for (File f : docDir.listFiles()) {
        String fileName = f.getName();
        Document d = new Document();
        d.add(new Field("Name",fileName,
                        Store.YES,Index.NOT_ANALYZED));
        d.add(new Field("Path",f.getPath(),Store.YES,Index.ANALYZED));
        long tamano = f.length();
        d.add(new Field("Size",""+tamano,Store.YES,Index.ANALYZED));
        long fechalong = f.lastModified();
        d.add(new Field("Modification_Date",""+fechalong,Store.YES,Index.ANALYZED));
        indexWriter.addDocument(d);
    }

    indexWriter.optimize();
    indexWriter.close();
    int numDocs = indexWriter.numDocs();

    System.out.println("Index Directory=" + indexDir.getCanonicalPath());
    System.out.println("Doc Directory=" + docDir.getCanonicalPath());
    System.out.println("num docs=" + numDocs);
    System.out.println("num chars=" + numChars);

} }


Thanks Edmondo1984, you are helping me a lot. 谢谢Edmondo1984,你帮了我很多忙。

Finally I did the code as shown below. 最后我做了如下所示的代码。 Storing the hash of the file, and then checking the modification date. 存储文件的哈希值,然后检查修改日期。

In 9300 index files takes 15 seconds, and re-index (without any index has not changed because no file) takes 15 seconds. 在9300索引文件需要15秒,并且重新索引(没有任何索引没有因为没有文件而改变)需要15秒。 Am I doing something wrong or I can optimize the code to take less? 我做错了什么还是我可以优化代码以减少占用?


Thanks jtahlborn, doing what I managed to equalize indexReader times to create and update. 谢谢jtahlborn,做我设法均衡indexReader时间来创建和更新。 Are not you supposed to update an existing index should be faster to recreate it? 是不是应该更新现有索引应该更快地重新创建它? Is it possible to further optimize the code? 是否有可能进一步优化代码?

if(IndexReader.indexExists(dir))
            {
                //reader is a IndexReader and is passed as parameter to the function
                //searcher is a IndexSearcher and is passed as parameter to the function
                term = new Term("Hash",String.valueOf(file.hashCode()));
                Query termQuery = new TermQuery(term);
                TopDocs topDocs = searcher.search(termQuery,1);
                if(topDocs.totalHits==1)
                {
                    Document doc;
                    int docId,comparedate;
                    docId=topDocs.scoreDocs[0].doc;
                    doc=reader.document(docId);
                    String dateIndString=doc.get("Modification_date");
                    long dateIndLong=Long.parseLong(dateIndString);
                    Date date_ind=new Date(dateIndLong);
                    String dateFichString=DateTools.timeToString(file.lastModified(), DateTools.Resolution.MINUTE);
                    long dateFichLong=Long.parseLong(dateFichString);
                    Date date_fich=new Date(dateFichLong);
                    //Compare the two dates
                    comparedates=date_fich.compareTo(date_ind);
                    if(comparedate>=0)
                    {
                        if(comparedate==0)
                        {
                            //If comparation is 0 do nothing
                            flag=2;
                        }
                        else
                        {
                            //if comparation>0 updateDocument
                            flag=1;
                        }
                    }

According to Lucene data model, you store documents inside the index. 根据Lucene数据模型,您可以将文档存储在索引中。 Inside each document you will have the fields that you want to index, which are so called "analyzed" and the fields which are not "analyzed", where you can store a timestamp and other information you might need later. 在每个文档中,您将拥有要编制索引的字段,即所谓的“已分析”字段和未“分析”的字段,您可以在其中存储时间戳以及稍后可能需要的其他信息。

I have the feeling you have a certain confusion between files and documents, because in your first post you speak about documents and now you are trying to call IndexFileNames.isDocStoreFile(file.getName()) which actually tells only if file is a file containing a Lucene index. 我觉得你在文件和文档之间有一定的混淆,因为在你的第一篇文章中你谈到文档,现在你试图调用IndexFileNames.isDocStoreFile(file.getName()),它实际上只告诉文件是否包含文件一个Lucene索引。

If you understand Lucene object model, writing the code you need takes approximately three minutes: 如果您了解Lucene对象模型,那么编写所需的代码大约需要三分钟:

  • You have to check if the document is already existing in the index (for example by storing a non-analyzed field containing a unique identifier), by simply querying Lucene. 您只需查询Lucene,就必须检查文档是否已存在于索引中(例如,通过存储包含唯一标识符的未分析字段)。
  • If your query returns 0 documents, you will add the new document to the index 如果查询返回0个文档,则会将新文档添加到索引中
  • If your query returns 1 document, you will get its "timestamp" field and compare it to the one of the new document you are trying to store. 如果您的查询返回1个文档,您将获得其“timestamp”字段,并将其与您尝试存储的新文档进行比较。 Then you can use the docId of the document to delete it from the index, if necessary, to add the new one. 然后,您可以使用文档的docId将其从索引中删除,如有必要,添加新文档。

If on the other side you are sure that you want always to modify the previous value, you can refer to this snippet from Lucene in Action: 如果在另一方面您确定要始终修改以前的值,则可以参考Lucene in Action中的此片段:

public void testUpdate() throws IOException { 
    assertEquals(1, getHitCount("city", "Amsterdam"));
    IndexWriter writer = getWriter();
    Document doc = new Document();
    doc.add(new Field("id", "1",
    Field.Store.YES,
    Field.Index.NOT_ANALYZED));
    doc.add(new Field("country", "Netherlands",
    Field.Store.YES,
    Field.Index.NO));
    doc.add(new Field("contents",
    "Den Haag has a lot of museums",
    Field.Store.NO,
    Field.Index.ANALYZED));
    doc.add(new Field("city", "Den Haag",
    Field.Store.YES,
    Field.Index.ANALYZED));
    writer.updateDocument(new Term("id", "1"),
    doc);
    writer.close();
    assertEquals(0, getHitCount("city", "Amsterdam"));
    assertEquals(1, getHitCount("city", "Den Haag"));
}

As you see, the snippets uses a non analyzed ID as I was suggesting to save a queryable - simple attribute, and method updateDocument to first delete and then re-add the doc. 如您所见,片段使用非分析ID,因为我建议保存可查询 - 简单属性,方法updateDocument首先删除然后重新添加文档。

You might want to directly check the javadoc at 你可能想直接检查javadoc

http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,org.apache.lucene.document.Document ) http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,org.apache.lucene.document。文件

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM