简体   繁体   中英

Lucene, multi-term search, one term must be exact match

The code below can directly run with Lucene 7.3.1

You only need to change the path of index storing path.

import java.io.IOException;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.MultiTerms;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.MultiFieldQueryParser;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.MultiPhraseQuery;
import org.apache.lucene.search.MultiTermQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.similarities.BM25Similarity;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.IOContext;
import org.apache.lucene.store.RAMDirectory;

public class Example {

    public static IndexWriter writer;
    public static RAMDirectory idxDir;
    public static SmartChineseAnalyzer analyzer;

    public static void makeIndex() throws IOException {

        FSDirectory fsDir = FSDirectory.open(Paths.get("C:\\Users\\gt\\Desktop\\example"));
        idxDir = new RAMDirectory(fsDir, IOContext.DEFAULT);
        analyzer = new SmartChineseAnalyzer();
        IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
        iwc.setSimilarity(new BM25Similarity());

        iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
        writer = new IndexWriter(idxDir, iwc);

        List<String> listSent = new ArrayList<String>();
        listSent.add("金古江湖是最好玩的金庸游戏1");
        listSent.add("金古江湖是最好玩的金庸游戏2");
        int id = 0;
        for (String sent : listSent) {
            id++;
            Document doc = new Document();
            doc.add(new TextField("questionType", "A", Field.Store.YES));
            doc.add(new TextField("questionId", "62650ACA7FEB446B9140B088EE7C2FF0", Field.Store.YES));
            doc.add(new TextField("question", sent.trim(), Field.Store.YES));
            writer.addDocument(doc);
        }

        writer.commit();
        writer.close();
    }

    public static void main(String[] args) throws IOException, ParseException {
        makeIndex();

        String[] stringQuery = { "A", "62650ACA7FEB446B9140B088EE7C2FF0aaaa", "金古江湖" };
        String[] fields = { "questionType", "questionId", "question" };
        Occur[] occ = { Occur.MUST, Occur.MUST, Occur.MUST };

//        Query query = new TermQuery(new Term("questionId","1"));
        Query query = MultiFieldQueryParser.parse(stringQuery, fields, occ, analyzer);


        TopDocs results = null;
        IndexReader reader = DirectoryReader.open(idxDir);
        IndexSearcher searcher = new IndexSearcher(reader);
        results = searcher.search(query, 5);
        ScoreDoc[] hits = results.scoreDocs;
        for (int i = 0; i < hits.length; ++i) {
            Document doc = searcher.doc(hits[i].doc);
            String strDocSent = doc.get("question");
            System.out.println(strDocSent);
        }
    }
}

In the code, I add two documents and make index for them.

Then I search the documents.

I want questionId field to be exact match, but now it does not.

How to search multi-term and one of the terms must be exact match, and the other terms' search policy can be fuzzy.

It's no so much that it's performing any sort of fuzzy search, it's that your analyzer is attempting to split the field into words. Your questionId 62650ACA7FEB446B9140B088EE7C2FF0aaaa is getting split into the following tokens:

  • 62650, aca, 7, feb, 446, b, 9140, b, 088, ee, 7, c, 2, ff, 0, aaaa

Since you want this to be an exact match, and generally behave like an ID, you should not apply your usual analyzer to it. Generally IDs like this is should be indexed with a StringField , instead of TextField , since StringFields are not analyzed.

On the query side, you can just use a simple TermQuery , and combine it with the rest of your query via a BooleanQuery . Or, if you want to work it into the QueryParser, you'll want to use PerFieldAnalyzerWrapper , something like:

Map<String,Analyzer> analyzerlist = new HashMap<>();
analyzerlist.put("questionId", new KeywordAnalyzer());
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new SmartChineseAnalyzer(), analyzerlist);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM