简体   繁体   中英

Lucene Index Query does not find document if too many documents/similar documents present

If I create documents as such:

{
    Document document = new Document();
    document.add(new TextField("id", "10384-10735", Field.Store.YES));
    submitDocument(document);
}
{
    Document document = new Document();
    document.add(new TextField("id", "10735", Field.Store.YES));
    submitDocument(document);
}

for (int i = 20000; i < 80000; i += 123) {
    Document otherDoc1 = new Document();
    otherDoc1.add(new TextField("id", String.valueOf(i), Field.Store.YES));
    submitDocument(otherDoc1);

    Document otherDoc2 = new Document();
    otherDoc2.add(new TextField("id", i + "-" + (i + 67), Field.Store.YES));
    submitDocument(otherDoc2);
}

meaning:

  • one with an id of 10384-10735
  • one with an id of 10735 (which is the last part of the previous document ID)
  • and 975 other documents with pretty much any ID

and then write them using:

final IndexWriterConfig luceneWriterConfig = new IndexWriterConfig(new StandardAnalyzer());
luceneWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);

final IndexWriter luceneDocumentWriter = new IndexWriter(luceneDirectory, luceneWriterConfig);

for (Map.Entry<String, Document> indexDocument : indexDocuments.entrySet()) {
    final Term term = new Term(Index.UNIQUE_LUCENE_DOCUMENT_ID, indexDocument.getKey());
    indexDocument.getValue().add(new TextField(Index.UNIQUE_LUCENE_DOCUMENT_ID, indexDocument.getKey(), Field.Store.YES));

    luceneDocumentWriter.updateDocument(term, indexDocument.getValue());
}

luceneDocumentWriter.close();

Now that the index is written, I want to perform a query, searching for the document with the ID 10384-10735 .

I will be doing this in two ways, using the TermQuery and a QueryParser with the StandardAnalyzer:

System.out.println("term query:   " + index.findDocuments(new TermQuery(new Term("id", "10384-10735"))));

final QueryParser parser = new QueryParser(Index.UNIQUE_LUCENE_DOCUMENT_ID, new StandardAnalyzer());
System.out.println("query parser: " + index.findDocuments(parser.parse("id:\"10384 10735\"")));

In both cases, I would expect the document to appear. This is the result if I run the queries however:

term query:   []
query parser: []

which seems odd. I experimented around a bit further and found out that if I either reduce the amount of documents OR remove the entry 10735 , the query parser query now successfully finds the document:

term query:   []
query parser: [Document<stored,indexed,tokenized<id:10384-10735> stored,indexed,tokenized<uldid:10384-10735>>]

meaning this works:

{
    Document document = new Document();
    document.add(new TextField("id", "10384-10735", Field.Store.YES));
    submitDocument(document);
}

for (int i = 20000; i < 80000; i += 123) {
    Document otherDoc1 = new Document();
    otherDoc1.add(new TextField("id", String.valueOf(i), Field.Store.YES));
    submitDocument(otherDoc1);

    Document otherDoc2 = new Document();
    otherDoc2.add(new TextField("id", i + "-" + (i + 67), Field.Store.YES));
    submitDocument(otherDoc2);
}

and this works (490 documents)

{
    Document document = new Document();
    document.add(new TextField("id", "10384-10735", Field.Store.YES));
    submitDocument(document);
}
{
    Document document = new Document();
    document.add(new TextField("id", "10735", Field.Store.YES));
    submitDocument(document);
}

for (int i = 20000; i < 50000; i += 123) {
    Document otherDoc1 = new Document();
    otherDoc1.add(new TextField("id", String.valueOf(i), Field.Store.YES));
    submitDocument(otherDoc1);

    Document otherDoc2 = new Document();
    otherDoc2.add(new TextField("id", i + "-" + (i + 67), Field.Store.YES));
    submitDocument(otherDoc2);
}

Does somebody know what causes this? I really need the index to consistently find the documents. I'm fine with using the QueryParser and not the TermQuery.

I use 9.3.0 lucene-core and lucene-queryparser.

Thank you for your help in advance.

Edit 1 : This is the code in findDocuments():

final TopDocs topDocs = getIndexSearcher().search(query, Integer.MAX_VALUE);

final List<Document> documents = new ArrayList<>((int) topDocs.totalHits.value);
for (int i = 0; i < topDocs.totalHits.value; i++) {
    documents.add(getIndexSearcher().doc(topDocs.scoreDocs[i].doc));
}

return documents;

Edit 2 : here is a working example: https://pastebin.com/Ft0r8pN5

for some reason, the issue with the too many documents does not happen in this one, which I will look into. I still left it in for the example. This is my output:

[similar id: true, many documents: true]
Indexing [3092] documents
term query:   []
query parser: []

[similar id: true, many documents: false]
Indexing [654] documents
term query:   []
query parser: []

[similar id: false, many documents: true]
Indexing [3091] documents
term query:   []
query parser: [Document<stored,indexed,tokenized<id:10384-10735> stored,indexed,tokenized<uldid:10384-10735>>]

[similar id: false, many documents: false]
Indexing [653] documents
term query:   []
query parser: [Document<stored,indexed,tokenized<id:10384-10735> stored,indexed,tokenized<uldid:10384-10735>>]

As you can see, if the document with the ID 10735 is added to the documents, the document cannot be found anymore.

At a first glance, a possible solution for this would be:
The updateDocument() method with a term passed as first parameter is currently used to build the index. When either passing null as term or using the addDocument() method, the query successfully returned the correct values. The solution must have something to do with the Term .

luceneDocumentWriter.addDocument(indexDocument.getFields());
// or
luceneDocumentWriter.updateDocument(null, indexDocument);

Playing around a bit further: the key of the term the document in question is stored under cannot be used as field key inside the document again, otherwise the document becomes unsearchable:

final Term term = new Term("uldid", indexDocument.get("id"));

// would work, different key from term...
indexDocument.add(new TextField("uldid2", indexDocument.get("id"), Field.Store.YES));

// would not work...
indexDocument.add(new TextField("uldid", indexDocument.get("id"), Field.Store.YES));

// ...when adding to index using term
luceneDocumentWriter.updateDocument(term, indexDocument);

Another way to circumvent this would be to use a different value from the identical field in the document ( uldid in this case), that is also different from the ID that is being searched in the index:

final Term term = new Term("uldid", indexDocument.get("id").hashCode() + "");
// or
indexDocument.add(new TextField("uldid", indexDocument.get("id").hashCode() + "", Field.Store.YES));

Which seems rather odd. I don't really have a final solution or reason this is the way it is, but I will be using the second option from now on, using the hash of whatever key I want to store the document under as Term .

Summary

The problem is caused by a combination of (a) the order in which your documents are processed; and (b) the fact that updateDocument first deletes and then inserts data in the index.

When you use writer.updateDocument(term, document) , Lucene performs an atomic delete-then-add:

Updates a document by first deleting the document(s) containing term and then adding the new document.

In your case, the order in which documents are processed is based on how they are retrieved from your Java Map - and that is based on how the entries are hashed by the map.

As you note in your answer , you already have a way to avoid this by using your Java object hashes as the updateDocument terms. (As long as you don't get any hash collisions.)

This answer attempts to explain the "why" behind the results you are seeing.


Basic Demonstration

This is a highly simplified version of your code.

Consider the following two Lucene documents:

final Document documentA = new Document();
documentA.add(new TextField(FIELD_NAME, "10735", Field.Store.YES));
final Term termA = new Term(FIELD_NAME, "10735");
writer.updateDocument(termA, documentA);
            
final Document documentB = new Document();
documentB.add(new TextField(FIELD_NAME, "10384-10735", Field.Store.YES));
final Term termB = new Term(FIELD_NAME, "10384-10735");
writer.updateDocument(termB, documentB);

documentA then documentB:

Lucene has nothing to delete when documentA is added. After the doc is added, the index contains the following:

field id
  term 10735
    doc 0
      freq 1
      pos 0

So, we have only one token 10735 .

For documentB , there are no documents in the index containing the term 10384-10735 - and therefore nothing is deleted prior to documentB being added to the index.

We end up with the following final indexed data:

field id
  term 10384
    doc 1
      freq 1
      pos 0
  term 10735
    doc 0
      freq 1
      pos 0
    doc 1
      freq 1
      pos 1

When we search for 10384 , we get one hit, as expected.

documentB then documentA:

If we swap the order in which the 2 documents are processed, we see the following after documentB is indexed:

field id
  term 10384
    doc 0
      freq 1
      pos 0
  term 10735
    doc 0
      freq 1
      pos 1

When documentA is indexed, Lucene finds that doc 0 (above) in the index does contain the term 10735 used by documentA . Therefore all of the doc 0 entries are deleted from the index, before documentA is added.

We end up with the following indexed data (basically, a new doc 0 , after the original doc 0 was deleted):

field id
  term 10735
    doc 0
      freq 1
      pos 0

Now when we search for 10384 , we get zero hits - not what we expected.


More Complicated Demonstration

Things are made more complicated in your scenario in the question by your use of a Java Map to collect the documents to be indexed. This causes the order in which your Lucene documents are indexed to be different from the order in which they are created, due to hashing performed by the map.

Here is another simplified version of your code, but this time it uses a map:

public class MyIndexBuilder {

    private static final String INDEX_PATH = "index";
    private static final String FIELD_NAME = "id";

    private static final Map<String, Document> indexDocuments = new HashMap<>();

    public static void buildIndex() throws IOException, FileNotFoundException, ParseException {
        final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));

        Analyzer analyzer = new StandardAnalyzer();

        IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
        iwc.setOpenMode(OpenMode.CREATE);
        //iwc.setCodec(new SimpleTextCodec());

        try ( IndexWriter writer = new IndexWriter(dir, iwc)) {
            
            String suffix = "10429";
            
            Document document1 = new Document();
            document1.add(new TextField("id", "10001-" + suffix, Field.Store.YES));
            indexDocuments.put("10001-" + suffix, document1);
            
            Document document2 = new Document();
            document2.add(new TextField("id", suffix, Field.Store.YES));
            indexDocuments.put(suffix, document2);
            
            //int max = 10193; // OK
            int max = 10192; // not OK
            
            for (int i = 10003; i <= max; i += 1) {
                Document otherDoc1 = new Document();
                otherDoc1.add(new TextField(FIELD_NAME, String.valueOf(i), Field.Store.YES));
                indexDocuments.put(String.valueOf(i), otherDoc1);
            }

            System.out.println("Total docs: " + indexDocuments.size());
            for (Map.Entry<String, Document> indexDocument : indexDocuments.entrySet()) {
                if (indexDocument.getKey().contains(suffix)) {
                    // show the order in which the document1 and document2 are indexed:
                    System.out.println(indexDocument.getKey());
                }
                final Term term = new Term(FIELD_NAME, indexDocument.getKey());
                writer.updateDocument(term, indexDocument.getValue());
            }
            
        }
    }

}

In addition to the two documents we are interested in, I add 191 additional (completely unrelated) documents to the index.

When I process the map, I see the following output:

Total docs: 193
10429
10001-10429

So, document2 is indexed before document1 - and our search for 10001 finds one hit.

But if I process fewer of these "extra" documents (190 instead of 191):

int max = 10192; // not OK

...then I get this output:

Total docs: 192
10001-10429
10429

You can see that the order in which document1 and document2 are processed has been flipped - and now that same search for 10001 finds zero hits.

A seemingly unrelated change (procesing one fewer document) has caused the retrieval order from the map to change, causing the indexed data to be different.

(I was incorrect in one of my comments in the question, when I noted that the indexed data was apparently identical. It is not the same. I missed that when I was first looking at the indexed data.)


Recommendation

Consider adding a new field to your Lucene documents, for storing each document's unique identifier.

You could call it doc_id and it would be created as a StringField , not as a TextField .

This would ensure that the contents of this field are never processed by the Standard Analyzer and are stored in the index as a single (presumably unique) token. A StringField is indexed but not tokenized.

You can then use this field when building your term to use in the updateDocument() method. And you can use the existing id field for searches.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM