I have an index whose documents have two fields (actually more like 800 fields but the other fields won't concern us here):
contents
field contains the analyzed/tokenized text of the document. The query string is searched for in this field. category
field contains the single category identifier of the document. There are about 2500 different categories, and a document may occur in several of them (ie a document may have multiple category
entries. The results are filtered by this field. The index contains about 20 mio. documents and is 5 GB in size.
The index is queried with a user-provided query string, plus an optional set of a few categories the user is not interested in. The question is : how can I remove those documents matching not only the query string but also the unwanted categories.
I could use a BooleanQuery
with a MUST_NOT
clause, ie something like this:
BooleanQuery q = new BooleanQuery();
q.add(contentQuery, BooleanClause.MUST);
for (String unwanted: unwantedCategories) {
q.add(new TermsQuery(new Term("category", unwanted), BooleanClause.MUST_NOT);
}
Is there a way to do this with Lucene filters? Performance is an issue here, and there will only be a few, recurring, variants of unwantedCategories
, so a CachingWrapperFilter
would probably help a lot. Also, due to the way the Lucene queries are generated in the existing code base, it is difficult to fit this in, whereas an extra Filter
could be introduced easily.
In other words, How do I create a Filter
based on what terms must _not_ occur in a document?
One word answer: BooleanFilter
, found it minutes after formulating the question:
BooleanFilter f = new BooleanFilter();
for (String unwanted: unwantedCategories) {
TermsFilter tf = new TermsFilter(new Term("category", unwanted));
f.add(new FilterClause(tf, BooleanClause.MUST_NOT));
}
You can use a QueryWrapperFilter to turn an arbitrary query into a filter. And you can use a CachingWrapperFilter to cache any filter. So something like:
BooleanQuery bq = new BooleanQuery();
// set up bq
Filter myFilter = new CachingWrapperFilter (
new QueryWrapperFilter (bq)
);
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.