How to iterate and update documents over lucene documents?

Question

I have simple code for iterate and update documents. Index is too large – millions of documents, 10-20gb. This is an pseudocode:

liveDocs = MultiFields.getLiveDocs(reader);
docsEnum = MultiFields.getTermDocsEnum(reader, 
  MultiFields.getLiveDocs(reader), field, bytesRef);
while ((doc = docsEnum.nextDoc()) != DocsEnum.NO_MORE_DOCS) {
  oldDocument = reader.document(doc);
  // some updates
  writer.updateDocument(term, newDocument, analyzer);
  break;
  // simple flush policy
  if(doc % 10000 == 0){
    writer.commit();
  }
}

DocsEnum worked correctly with reader, which it initialised. But related to reader index segments(files) does't removed before reader is opened, and index size is doubled each update iteration. After day of work, index size is terabytes! If close all readers and writes, and reopen index - old segments will be removed. How to correctly iterate & update documents without disk files leak?

I use java 1.7, lucene 4.8

Answer 1

Best solution that I find - use IndexSearcher.search() && IndexSearcher.searchAfter().

Something like this:

// inside iterator
TopDocs docs;
if (lastScore == null) {
    docs = searcher.search(query, filter, limit, Sort.INDEXORDER, false, false);
} else {
    docs = searcher.searchAfter(lastScore, query, filter, limit, Sort.INDEXORDER, false, false);
}
lastScore = docs.scoreDocs[docs.scoreDocs.length - 1];
for (ScoreDoc scoreDoc : docs.scoreDocs) {
    Document = searcher.doc(scoreDoc.doc, fields));
}

How to iterate and update documents over lucene documents?

Question

1 answers

solution1
0 ACCPTED 2014-05-20 13:40:41

How to iterate and update documents over lucene documents?

Question

1 answers

solution1 0 ACCPTED 2014-05-20 13:40:41

solution1
0 ACCPTED 2014-05-20 13:40:41