简体   繁体   中英

Lucene search engine isn't accurate, can't figure out why

I am trying to create a search engine for the first time, and I'm using the library offered by Apache Lucene. Everything works fine, however when I search for more than one word, for example "computer science" the results that I get aren't accurate because I never get documents that contain both words. It searches the documents for each word separately (I get documents that contain either "computer" or "science" but never both).

I've been staring at my code for almost a week now and I can't figure out the problem. The query parsing seems to work perfectly, so I think the problem might be in the search but I don't know what I'm doing wrong. So If you can help me, I'll be grateful.

    public static wikiPage[] index(String searchQuery) throws SQLException, IOException, ParseException {

    String sql = "select * from Record";
    ResultSet rs = db.runSql(sql);

    StandardAnalyzer analyzer = new StandardAnalyzer();
    Directory index = new RAMDirectory();
    IndexWriterConfig config = new IndexWriterConfig(analyzer);

    //1. Indexer
    try (IndexWriter w = new IndexWriter(index, config)) {
        while (rs.next()) {
            String RecordID = rs.getString("RecordID");
            String URL = rs.getString("URL");
            String Title = rs.getString("Title");
            String Info = rs.getString("Info");

            addDoc(w, RecordID, URL, Info, Title);
        }

    } 
    catch (Exception e) {
        System.out.print(e);
        index.close();
    }

     //2. Query
    MultiFieldQueryParser multipleQueryParser = new MultiFieldQueryParser(new String[]{"Title", "Info"}, new StandardAnalyzer());
    Query q = multipleQueryParser.parse(searchQuery);


    //3. Search
    IndexReader reader = DirectoryReader.open(index);
    IndexSearcher searcher = new IndexSearcher(reader);
    TopDocs results = searcher.search(q, 10000);
    ScoreDoc[] hits = results.scoreDocs;


    // 4. display results
    wikiPage[] resultArray = new wikiPage[hits.length];
    System.out.println("Found " + hits.length + " hits.");
    for (int i = 0; i < hits.length; ++i) {
        int docId = hits[i].doc;
        Document d = searcher.doc(docId);
        resultArray[i] = new wikiPage(d.get("URL"), d.get("Title"));
        System.out.println((i + 1) + ". " + d.get("Title") + "\t" + d.get("URL"));
    }
    reader.close();
    return resultArray;
}

    private static void addDoc(IndexWriter w, String RecordID, String URL, String Info, String Title) throws IOException {
    Document doc = new Document();
    doc.add(new StringField("RecordID", RecordID, Field.Store.YES));
    doc.add(new TextField("Title", Title, Field.Store.YES));
    doc.add(new TextField("URL", URL, Field.Store.YES));
    doc.add(new TextField("Info", Info, Field.Store.YES));

    w.addDocument(doc);

}

This is the output of System.out.println(q.toString());

  (Title:computer Info:computer) (Title:science Info:science)

If you want to search it as a phrase (that is, finding "computer" and "science" together ), surround the query with quotes, so it should look like "computer science" . In your code, you could do something like:

Query q = multipleQueryParser.parse("\"" + searchQuery + "\"");

If you just want to find docs that contain both terms somewhere in the document, but not necessarily together, the query should look like +computer +science . Probably the easiest way to do this is to change the default operator of your query parser:

multipleQueryParser.setDefaultOperator(QueryParser.Operator.AND);
Query q = multipleQueryParser.parse(searchQuery);

As per the doc, prefix required terms with + and use AND (and OR for readability).

Try this:

(Title:+computer OR Info:+computer) AND (Title:+science OR Info:+science)

Maybe build this string and use it directly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM