简体   繁体   中英

Lucene: Filtering for documents NOT containing a Term

I have an index whose documents have two fields (actually more like 800 fields but the other fields won't concern us here):

  • The contents field contains the analyzed/tokenized text of the document. The query string is searched for in this field.
  • The category field contains the single category identifier of the document. There are about 2500 different categories, and a document may occur in several of them (ie a document may have multiple category entries. The results are filtered by this field.

The index contains about 20 mio. documents and is 5 GB in size.

The index is queried with a user-provided query string, plus an optional set of a few categories the user is not interested in. The question is : how can I remove those documents matching not only the query string but also the unwanted categories.

I could use a BooleanQuery with a MUST_NOT clause, ie something like this:

BooleanQuery q = new BooleanQuery();
q.add(contentQuery, BooleanClause.MUST);
for (String unwanted: unwantedCategories) {
    q.add(new TermsQuery(new Term("category", unwanted), BooleanClause.MUST_NOT);
}

Is there a way to do this with Lucene filters? Performance is an issue here, and there will only be a few, recurring, variants of unwantedCategories , so a CachingWrapperFilter would probably help a lot. Also, due to the way the Lucene queries are generated in the existing code base, it is difficult to fit this in, whereas an extra Filter could be introduced easily.

In other words, How do I create a Filter based on what terms must _not_ occur in a document?

One word answer: BooleanFilter , found it minutes after formulating the question:

BooleanFilter f = new BooleanFilter();
for (String unwanted: unwantedCategories) {
    TermsFilter tf = new TermsFilter(new Term("category", unwanted));
    f.add(new FilterClause(tf, BooleanClause.MUST_NOT));
}

You can use a QueryWrapperFilter to turn an arbitrary query into a filter. And you can use a CachingWrapperFilter to cache any filter. So something like:

BooleanQuery bq = new BooleanQuery();
// set up bq
Filter myFilter = new CachingWrapperFilter (
                     new QueryWrapperFilter (bq)
                  );

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM