I have a method that searches and delete documents from my Lucene index.
However, when I run the code twice, it still finds the documents that where marked to be deleted from the previous iteration, and indexReader.hasDeletions() evaluates true.
public void duplicatesRemover(String currentIndex) throws Exception {
Directory directory = FSDirectory.open(new File(currentIndex));
IndexReader indexReader = IndexReader.open(directory, false);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
int dups = 0;
for (int i = 0; i < indexReader.numDocs(); i++) {
Document doc = indexReader.document(i);
int articleId = Integer.parseInt(doc.get("articleId"));
Query q = NumericRangeQuery.newIntRange("articleId", articleId, articleId, true, true);
TopDocs topDocs = indexSearcher.search(q, 10);
if (topDocs.totalHits > 1 ) {
indexReader.deleteDocument(i);
System.out.print("Total matches from search found: " + topDocs.totalHits + " articleId = " + articleId);
System.out.println(" total dups found " + ++dups + "/" + i);
}
}
if(indexReader.hasDeletions()){
System.out.println("Has deletions");
Map<String, String> commitUserData = new HashMap<String, String>();
commitUserData.put("foo", "fighter");
indexReader.commit(commitUserData);
}
indexSearcher.close();
indexReader.close();
directory.close();
}
Many thanks yogi
What Lucene version are you using? The deleteDocument
and commit
methods are deprecated. Those actions should be done threw an IndexWriter
as mentioned here .
Regarding your problem i don't think it is good practice to manipulate the index while an IndexSearcher
is open. I would start by checking this direction.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.