简体   繁体   中英

Lucene external document Id deviating from internal index docId

Using Lucene, I am doing some evaluation on rather classic Testcollections containing Documents, Queries and Relevance Files (qrels). The qrels tell us which documents should be returned by lucene as relevant to a specific query, so lucenes search quality can be measured (with some parameters, but that's not important right now).

My problem is: the documents in the test collections (ie the TIME collection) have their own document IDs - however, these can have gaps (for example: TIME collection contains 423 documents, but starts with document ID 17 and ends with ID 563). The Document ID is indexed and stored as an IntField.

document.add(new IntField(Constants.INDEX_ID_FIELD, testDocument.getId(),Field.Store.YES));

However, I can (maybe even should) not use the IndexReader.getTermVectors() method to access documents by their external IDs, because the internal docId used by Lucene inside that method does not match the external ID (because of the gaps). I get an error saying "docID must be >= 0 and < maxDoc=423 (got docID=520)".

What would be the preferred way to make lucene correctly access the Document 520 to invoke the getTermVectors method for the document via the internal docId? I tried to get the correct Document this way:

IndexSearcher searcher = myTestRunner.indexSearcher;
TermQuery query = new TermQuery(new Term(Constants.INDEX_ID_FIELD, String.valueOf(docIdx)));
TopDocs topdocs = searcher.search(query, 1);
ScoreDoc[] treffer = topdocs.scoreDocs;
int docId = treffer[0].doc;
Terms vector = myTestRunner.indexReader.getTermVector(docId, "content");
// ... some more code follows

However, the Document doesn't seem to be found (but it is in the index - checked using Luke). I always get:

2015-03-19 12:23:25 ERROR ControlView:1002 - 0 java.lang.ArrayIndexOutOfBoundsException: 0
at de.janjan.irtool.querygenerator.QueryGenerator.getFrequencies(QueryGenerator.java:335)

My next idea would be to make the IntField a normal Field, but maybe I'm completely on the wrong track here? Any help woukd be greatly appreciated.

Thanks a lot in advance! Jan

Regarding Lucene's internal DocID (that is, the one you see in ScoreDoc.doc ), you shouldn't use it as an external id. They can change without warning (especially if you ever update documents).

Numeric fields (such as IntField) are not indexed as plain text, but rather encoded into a form that makes searching on numeric ranges efficient. To search for a them, you should use a NumericRangeQuery , such as:

Query query = NumericRangeQuery.newIntRange(Constants.INDEX_ID_FIELD, docIdx, docIdx, true, true);

However, if this is a typical id field, I wouldn't use an IntField . Most of the time identifiers like this are composed of digits for convenience, rather than because they represent meaningful numbers. Generally, if it doesn't make sense to search that field with a numeric range, you might be best served by using a StringField instead.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM