简体   繁体   English

如何在不删除文档的情况下保持Lucene索引

[英]How to keep Lucene index without deleted documents

This is my first question on Stack Overflow,so wish me luck. 这是我关于堆栈溢出的第一个问题,请祝我好运。

I am doing a classification process over a Lucene index with java and i need to update a document field named category. 我正在使用Java对Lucene索引进行分类,因此我需要更新名为category的文档字段。 I have been using Lucene 4.2 with the index writer updateDocument() function for that purpose and its working very well, except for the deletion part. 为此,我一直在使用带有索引编写器updateDocument()函数的Lucene 4.2,除了删除部分外,它运行得很好。 Even if i use the forceMergeDeletes() function after the update the index show me some already deleted documents. 即使我在更新后使用forceMergeDeletes()函数,索引也会向我显示一些已删除的文档。 For example, if I run the classification over an index with 1000 documents the final amount of documents in the index remain the same and work as expected, but when I increase the index documents to 10000 the index shows some already deleted documents but not all. 例如,如果我对具有1000个文档的索引进行分类,则索引中的最终文档数量将保持不变并按预期工作,但是当我将索引文档增加到10000时,索引将显示一些已删除的文档,但不是全部。 So, how can I actually erase those deleted documents from index? 那么,如何才能从索引中删除那些删除的文档呢?

Here is some snippets of my code: 这是我的代码片段:

public static void main(String[] args) throws IOException, ParseException {
    ///////////////////////Preparing config data////////////////////////////
    File indexDir = new File("/indexDir");
    Directory fsDir = FSDirectory.open(indexDir);

    IndexWriterConfig iwConf = new IndexWriterConfig(Version.LUCENE_42, new WhitespaceSpanishAnalyzer());
    iwConf.setOpenMode(IndexWriterConfig.OpenMode.APPEND);
    IndexWriter indexWriter = new IndexWriter(fsDir, iwConf);

    IndexReader reader = DirectoryReader.open(fsDir);
    IndexSearcher indexSearcher = new IndexSearcher(reader);
    KNearestNeighborClassifier classifier = new KNearestNeighborClassifier(100);
    AtomicReader ar = new SlowCompositeReaderWrapper((CompositeReader) reader);

    classifier.train(ar, "text", "category", new WhitespaceSpanishAnalyzer());

    System.out.println("***Before***");
    showIndexedDocuments(reader);
    System.out.println("***Before***");

    int maxdoc = reader.maxDoc();
    int j = 0;
    for (int i = 0; i < maxdoc; i++) {
        Document doc = reader.document(i);
        String clusterClasif = doc.get("category");
        String text = doc.get("text");
        String docid = doc.get("doc_id");
        ClassificationResult<BytesRef> result = classifier.assignClass(text);
        String classified = result.getAssignedClass().utf8ToString();

        if (!classified.isEmpty() && clusterClasif.compareTo(classified) != 0) {
            Term term = new Term("doc_id", docid);
            doc.removeField("category");
            doc.add(new StringField("category",
                    classified, Field.Store.YES));
            indexWriter.updateDocument(term,doc);
            j++;
        }
    }
    indexWriter.forceMergeDeletes(true);
    indexWriter.close();
    System.out.println("Classified documents count: " + j);        
    System.out.println();
    reader.close();

    reader = DirectoryReader.open(fsDir);
    System.out.println("Deleted docs: " + reader.numDeletedDocs());
    System.out.println("***After***");
    showIndexedDocuments(reader);
}

private static void showIndexedDocuments(IndexReader reader) throws IOException {
    int maxdoc = reader.maxDoc();
    for (int i = 0; i < maxdoc; i++) {
        Document doc = reader.document(i);
        String idDoc = doc.get("doc_id");
        String text = doc.get("text");
        String category = doc.get("category");

        System.out.println("Id Doc: " + idDoc);
        System.out.println("Category: " + category);
        System.out.println("Text: " + text);
        System.out.println();
    }
    System.out.println("Total: " + maxdoc);
}

I have spend many hours looking for a solution to this, someones say that the deleted documents in the index are not important and that eventually they will be erased when we keep adding documents to the index, but I need to control that process in a way I can iterate over the index documents at any time and that the documents I retrieve are actually the lived ones. 我花了很多时间寻找解决方案,有人说索引中被删除的文档并不重要,并且当我们继续向索引中添加文档时,最终它们会被删除,但是我需要以某种方式控制该过程我可以随时遍历索引文档,并且检索到的文档实际上是存在的文档。 Lucene versions previous to 4.0 had a function in the IndexReader class named isDeleted(docId) that gives if a document has been marked has deleted, that could be just half of the solution to my problem but I have not found a way to do that with the version 4.2 of Lucene. Lucene 4.0之前的版本在IndexReader类中名为isDeleted(docId)的函数提供了是否已删除文档的标记,这可能只是我的问题的解决方案的一半,但是我没有找到一种解决方法Lucene的4.2版本。 If you know how to do that I really appreciate if you share it. 如果您知道该怎么做,我非常感谢您的分享。

You can check is a document is deleted is the MultiFields class, like: 您可以检查是否删除了文档是MultiFields类,例如:

Bits liveDocs = MultiFields.getLiveDocs(reader);
if (!liveDocs.get(docID)) ...

So, working this into your code, perhaps something like: 因此,将其处理到您的代码中,可能类似于:

int maxdoc = reader.maxDoc();
Bits liveDocs = MultiFields.getLiveDocs(reader);
for (int i = 0; i < maxdoc; i++) {
    if (!liveDocs.get(docID)) continue;
    Document doc = reader.document(i);
    String idDoc = doc.get("doc_id");
    ....
}

By the way, sounds like you have previously been working with 3.X, and are now on 4.X. 顺便说一句,听起来您以前使用过3.X,现在已经使用4.X。 The Lucene Migration Guide is very helpful for these understanding these sorts of changes between versions, and how to resolve them. 《 Lucene迁移指南》对于理解这些版本之间的此类更改以及如何解决它们非常有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM