為文檔添加砝碼 Lucene 8

Question

我目前正在使用Lucene 8為大學開發一個小型搜索引擎。 我之前已經構建了它，但沒有對文檔應用任何權重。

我現在需要添加文檔的 PageRanks 作為每個文檔的權重，並且我已經計算了 PageRank 值。 如何在 Lucene 8 中為Document object（非查詢詞）添加權重？ 我在網上查找了許多解決方案，但它們僅適用於舊版本的 Lucene。 示例源

這是我的（更新的）代碼，它從File object 生成Document object：

public static Document getDocument(File f) throws FileNotFoundException, IOException {
    Document d = new Document();

    //adding a field
    FieldType contentType = new FieldType();
    contentType.setStored(true);
    contentType.setTokenized(true);
    contentType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
    contentType.setStoreTermVectors(true);

    String fileContents = String.join(" ", Files.readAllLines(f.toPath(), StandardCharsets.UTF_8));
    d.add(new Field("content", fileContents, contentType));

    //adding other fields, then...

    //the boost coefficient (updated):
    double coef = 1.0 + ranks.get(path);
    d.add(new DoubleDocValuesField("boost", coef));

    return d;

}

The issue with my current approach is that I would need a CustomScoreQuery object to search the documents, but this is not available in Lucene 8. Also, I don't want to downgrade now to Lucene 7 after all the code I wrote in Lucene 8 .

編輯：

經過一些（冗長的）研究，我在每個包含提升的文檔中添加了一個DoubleDocValuesField （請參閱上面的更新代碼），並按照@EricLavault 的建議使用FunctionScoreQuery進行搜索。然而，現在我所有的文檔都得到了完全提升的分數，不管查詢如何！ 我該如何解決？ 這是我搜索的 function：

public static TopDocs search(String query, IndexSearcher searcher, String outputFile) {
    try {
        Query q_temp = buildQuery(query); //the original query, was working fine alone

        Query q = new FunctionScoreQuery(q_temp, DoubleValuesSource.fromDoubleField("boost")); //the new query
        q = q.rewrite(DirectoryReader.open(bm25IndexDir));
        TopDocs results = searcher.search(q, 10);

        ScoreDoc[] filterScoreDosArray = results.scoreDocs;
        for (int i = 0; i < filterScoreDosArray.length; ++i) {
            int docId = filterScoreDosArray[i].doc;
            Document d = searcher.doc(docId);

            //here, when printing, I see that the document's score is the same as its "boost" value. WHY??
            System.out.println((i + 1) + ". " + d.get("path")+" Score: "+ filterScoreDosArray[i].score);
        }

        return results;
    }
    catch(Exception e) {
        e.printStackTrace();
        return null;
    }
}

//function that builds the query, working fine
public static Query buildQuery(String query) {
    try {
        PhraseQuery.Builder builder = new PhraseQuery.Builder();
        TokenStream tokenStream = new EnglishAnalyzer().tokenStream("content", query);
        tokenStream.reset();

        while (tokenStream.incrementToken()) {
          CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
          builder.add(new Term("content", charTermAttribute.toString()));
        }

        tokenStream.end(); tokenStream.close();
        builder.setSlop(1000);
        PhraseQuery q = builder.build();

        return q;
    }
    catch(Exception e) {
        e.printStackTrace();
        return null;
    }
}

Answer 1

從Lucene 6.5.0開始：

不推薦使用索引時間提升。 作為替代，索引時間評分因素應該被索引到一個文檔值字段中，並在查詢時使用例如。 函數評分查詢。 （阿德里安·格蘭德）

建議不要使用索引時間提升，而是將評分因子（即長度歸一化因子）編碼到文檔值字段中。 （參見LUCENE-6819 ）

Answer 2

關於我編輯的問題（提升值完全取代搜索分數而不是提升它），以下是文檔中關於FunctionScoreQuery的內容（強調我的）：

包裝另一個查詢的查詢，並使用 DoubleValuesSource替換或修改包裝查詢的分數。

那么，什么時候替換，什么時候修改呢？

事實證明，我使用的代碼是用提升值完全替換分數：

Query q = new FunctionScoreQuery(q_temp, DoubleValuesSource.fromDoubleField("boost")); //the new query

我需要做的是使用 function boostByValue來修改搜索分數（通過將分數乘以提升值）：

Query q = FunctionScoreQuery.boostByValue(q_temp, DoubleValuesSource.fromDoubleField("boost"));

現在它起作用了！ 感謝@EricLavault 的幫助！

為文檔添加砝碼 Lucene 8

問題描述

2 個解決方案

解決方案1
1 2019-11-07 11:48:51

解決方案2
1 已采納 2019-11-07 21:36:33

為文檔添加砝碼 Lucene 8

問題描述

2 個解決方案

解決方案1 1 2019-11-07 11:48:51

解決方案2 1 已采納 2019-11-07 21:36:33

解決方案1
1 2019-11-07 11:48:51

解決方案2
1 已采納 2019-11-07 21:36:33