简体   繁体   English

Lucene搜索引擎不准确,无法找出原因

[英]Lucene search engine isn't accurate, can't figure out why

I am trying to create a search engine for the first time, and I'm using the library offered by Apache Lucene. 我正在尝试首次创建搜索引擎,并且正在使用Apache Lucene提供的库。 Everything works fine, however when I search for more than one word, for example "computer science" the results that I get aren't accurate because I never get documents that contain both words. 一切正常,但是当我搜索多个单词(例如“计算机科学”)时,我得到的结果并不准确,因为我从来没有得到包含两个单词的文档。 It searches the documents for each word separately (I get documents that contain either "computer" or "science" but never both). 它分别在文档中搜索每个单词(我得到的文档中包含“计算机”或“科学”,但都不包含两者)。

I've been staring at my code for almost a week now and I can't figure out the problem. 我已经盯着我的代码将近一个星期了,但我无法弄清问题所在。 The query parsing seems to work perfectly, so I think the problem might be in the search but I don't know what I'm doing wrong. 查询解析似乎完美地工作,所以我认为问题可能出在搜索中,但我不知道我在做什么错。 So If you can help me, I'll be grateful. 因此,如果您能帮助我,我将不胜感激。

    public static wikiPage[] index(String searchQuery) throws SQLException, IOException, ParseException {

    String sql = "select * from Record";
    ResultSet rs = db.runSql(sql);

    StandardAnalyzer analyzer = new StandardAnalyzer();
    Directory index = new RAMDirectory();
    IndexWriterConfig config = new IndexWriterConfig(analyzer);

    //1. Indexer
    try (IndexWriter w = new IndexWriter(index, config)) {
        while (rs.next()) {
            String RecordID = rs.getString("RecordID");
            String URL = rs.getString("URL");
            String Title = rs.getString("Title");
            String Info = rs.getString("Info");

            addDoc(w, RecordID, URL, Info, Title);
        }

    } 
    catch (Exception e) {
        System.out.print(e);
        index.close();
    }

     //2. Query
    MultiFieldQueryParser multipleQueryParser = new MultiFieldQueryParser(new String[]{"Title", "Info"}, new StandardAnalyzer());
    Query q = multipleQueryParser.parse(searchQuery);


    //3. Search
    IndexReader reader = DirectoryReader.open(index);
    IndexSearcher searcher = new IndexSearcher(reader);
    TopDocs results = searcher.search(q, 10000);
    ScoreDoc[] hits = results.scoreDocs;


    // 4. display results
    wikiPage[] resultArray = new wikiPage[hits.length];
    System.out.println("Found " + hits.length + " hits.");
    for (int i = 0; i < hits.length; ++i) {
        int docId = hits[i].doc;
        Document d = searcher.doc(docId);
        resultArray[i] = new wikiPage(d.get("URL"), d.get("Title"));
        System.out.println((i + 1) + ". " + d.get("Title") + "\t" + d.get("URL"));
    }
    reader.close();
    return resultArray;
}

    private static void addDoc(IndexWriter w, String RecordID, String URL, String Info, String Title) throws IOException {
    Document doc = new Document();
    doc.add(new StringField("RecordID", RecordID, Field.Store.YES));
    doc.add(new TextField("Title", Title, Field.Store.YES));
    doc.add(new TextField("URL", URL, Field.Store.YES));
    doc.add(new TextField("Info", Info, Field.Store.YES));

    w.addDocument(doc);

}

This is the output of System.out.println(q.toString()); 这是System.out.println(q.toString());的输出

  (Title:computer Info:computer) (Title:science Info:science)

If you want to search it as a phrase (that is, finding "computer" and "science" together ), surround the query with quotes, so it should look like "computer science" . 如果要以短语搜索它(即一起查找“计算机”和“科学”),请在查询中加上引号,因此它应看起来像"computer science" In your code, you could do something like: 在您的代码中,您可以执行以下操作:

Query q = multipleQueryParser.parse("\"" + searchQuery + "\"");

If you just want to find docs that contain both terms somewhere in the document, but not necessarily together, the query should look like +computer +science . 如果您只想查找在文档中某处包含两个术语但不一定包含在一起的文档,则查询应类似于+computer +science Probably the easiest way to do this is to change the default operator of your query parser: 可能最简单的方法是更改​​查询解析器的默认运算符:

multipleQueryParser.setDefaultOperator(QueryParser.Operator.AND);
Query q = multipleQueryParser.parse(searchQuery);

As per the doc, prefix required terms with + and use AND (and OR for readability). 根据文档,在必填项前加上+并使用AND (为了便于阅读,则使用OR )。

Try this: 尝试这个:

(Title:+computer OR Info:+computer) AND (Title:+science OR Info:+science)

Maybe build this string and use it directly. 也许构建此字符串并直接使用它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM