為什么 ElasticSearch 中的“更像這樣”不遵守單個術語的 TF-IDF 順序？

Question

我一直在嘗試理解 ElasticSearch 中的“更像這樣”功能。我已經閱讀並重新閱讀了文檔，但我無法理解為什么會出現以下行為。

基本上，我插入了三個文檔，並嘗試使用max_query_terms=1進行“更像這個查詢”，期望使用更高的 TF-IDF 術語，但事實並非如此。

curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
   "message": "dog barks"
}';
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
   "message": "cat fur"
}';
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
   "message": "cat naps"
}';
curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
    "query": {
        "more_like_this" : {
            "like" : ["cat", "dog"],
            "fields" : ["message"],
            "minimum_should_match" : 1,
            "min_term_freq" : 1,
            "min_doc_freq" : 1,
            "max_query_terms" : 1
        }
    }
}';

預計 output：

"dog barks"文件

實際 output：

"cat naps"和"cat fur"文檔（另請參閱下面有關確定性的說明）

預期 output 的解釋：

在文檔中它提到

假設我們想找到與給定輸入文檔相似的所有文檔。 顯然，輸入文檔本身應該是該類型查詢的最佳匹配項。 根據 Lucene 評分公式，原因主要是由於 tf-idf 最高的術語。 因此，具有最高 tf-idf 的輸入文檔的術語是該文檔的良好代表，並且可以在析取查詢（或 OR）中使用以檢索相似的文檔。 MLT 查詢簡單地從輸入文檔中提取文本，對其進行分析，通常在該字段使用相同的分析器，然后選擇具有最高 tf-idf 的前 K 個術語以形成這些術語的析取查詢。

由於我指定max_query_terms = 1 ，因此只有輸入文檔中具有最高 TF-IDF 分數的術語才能用於析取查詢。 在這種情況下，輸入文檔有兩個術語。 它們在輸入文檔中具有相同的詞頻，但 cat 在語料庫中出現的頻率是其兩倍，因此它具有更高的文檔頻率。 因此， dog的 TF-IDF 分數應該高於cat ，因此我希望析取查詢只是"message":"dog"並且返回的結果是"dog barks"事件。

我試圖了解這里發生了什么。 非常感謝任何幫助。 :)

關於確定性的注意事項

我嘗試重新運行此設置幾次。 在curl -XDELETE 'http://localhost:9200/samples'命令后運行上面的 4 個 ES 命令（3 POST + MLT GET）時，有時我會得到"cat naps"和"cat fur" ，但其他時候我會聽到"cat naps" 、 "cat fur"和"dog barks" ，有幾次我什至只會聽到"dog barks" 。

完整 output

早些時候我用手揮了揮手，只是說了 GET 查詢的輸出是什么。 讓我更准確地說實際 output #1（有時會發生）：

{"took":1,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":2,"max_score":0.6931472,"hits":
[{"_index":"samples","_type":"_doc","_id":"UHAoI3IBapDWjHWvsQ0_","_score":0.6931472,"_source":{
   "message": "cat fur"
}},{"_index":"samples","_type":"_doc","_id":"UXAoI3IBapDWjHWvsQ1c","_score":0.2876821,"_source":{
   "message": "cat naps"
}}]}}

實際 output #2（有時會發生）：

{"took":2,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":3,"max_score":0.2876821,"hits":
[{"_index":"samples","_type":"_doc","_id":"VHAtI3IBapDWjHWvvA0B","_score":0.2876821,"_source":{
   "message": "cat fur"
}},{"_index":"samples","_type":"_doc","_id":"U3AtI3IBapDWjHWvuw3l","_score":0.2876821,"_source":{
   "message": "dog barks"
}},{"_index":"samples","_type":"_doc","_id":"VXAtI3IBapDWjHWvvA0V","_score":0.2876821,"_source":{
   "message": "cat naps"
}}]}}

實際 output #3（三者中最罕見的情況）：

{"took":1,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":1,"max_score":0.9808292,"hits":
[{"_index":"samples","_type":"_doc","_id":"WXAzI3IBapDWjHWvbQ3s","_score":0.9808292,"_source":{
   "message": "dog barks"
}}]}}

嘗試間隔插入和 MLT 更多

也許 elasticsearch 處於一種奇怪的“處理狀態”，需要在文檔之間留出一些時間。 所以我在插入文檔和運行 GET 命令之間給了 ES 一些時間。

filename="testEsOutput-10-incremental.txt"
amount=10
echo "Test-10-incremental"
for i in {1..10}
do
    curl -XDELETE 'http://localhost:9200/samples';
    sleep $amount
    curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
       "message": "dog barks"
    }';
    sleep $amount
    curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
       "message": "cat fur"
    }';
    sleep $amount
    curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
       "message": "cat naps"
    }';
    sleep $amount

    curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
        "query": {
            "more_like_this" : {
                "like" : ["cat", "dog"],
                "fields" : ["message"],
                "minimum_should_match" : 1,
                "min_term_freq" : 1,
                "min_doc_freq" : 1,
                "max_query_terms" : 1
            }
        }
    }' >> $filename
    echo "\n\r----\n\r" >> $filename
    echo "----\n\r" >> $filename
done
echo "Done!"

然而，這似乎並沒有以任何有意義的方式影響非確定性的 output。

試過`search_type=dfs_query_then_fetch`

在這篇關於 ES nondeterminism 的 SO 帖子之后，我嘗試添加 dfs_query_then_fetch 選項，又名

curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/?search_type=dfs_query_then_fetch' -d '{
        "query": {
            "more_like_this" : {
                "like" : ["cat", "dog"],
                "fields" : ["message"],
                "minimum_should_match" : 1,
                "min_term_freq" : 1,
                "min_doc_freq" : 1,
                "max_query_terms" : 1
            }
        }
    }'

但是，結果仍然不是確定性的，並且它們在三個選項之間有所不同。

補充筆記

我嘗試通過查看其他調試信息

curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_validate/query?rewrite=true' -d '{
    "query": {
        "more_like_this" : {
            "like" : ["cat", "dog"],
            "fields" : ["message"],
            "minimum_should_match" : 1,
            "min_term_freq" : 1,
            "min_doc_freq" : 1,
            "max_query_terms" : 1
        }
    }
}';

但這有時 output

{"_shards":{"total":1,"successful":1,"failed":0},"valid":true,"explanations":
[{"index":"samples","valid":true,"explanation":"message:cat"}]}

其他時候

{"_shards":{"total":1,"successful":1,"failed":0},"valid":true,"explanations":
[{"index":"samples","valid":true,"explanation":"like:[cat, dog]"}]}

所以 output 甚至不是確定性的（背靠背運行）。

注意：在 ElasticSearch 6.8.8 上進行了本地和在線 REPL 測試。 還通過使用實際文檔進行了測試，例如

curl -XPUT --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/72 -d '{
   "message" : "dog cat"
}';
curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
    "query": {
        "more_like_this" : {
            "like" : {
                "_id" : "72"
            }
            ,
            "fields" : ["message"],
            "minimum_should_match" : 1,
            "min_term_freq" : 1,
            "min_doc_freq" : 1,
            "max_query_terms" : 1
        }
    }
}';

但得到了相同的"cat naps"和"cat fur"事件。

Answer 1

好的，經過多次調試，我嘗試將索引限制為一個分片，也就是

curl -XPUT --header 'Content-Type: application/json' 'http://localhost:9200/samples' -d '{
    "settings" : {
        "index" : {
            "number_of_shards" : 1, 
            "number_of_replicas" : 0 
        }
    }
}';

當我這樣做時，100% 的時間里，我只得到了"dog barks"文件。

似乎即使在使用search_type=dfs_query_then_fetch選項（使用多分片索引）時，ES 仍然沒有完全准確地完成工作。 我不確定我可以使用哪些其他選項來強制執行准確的行為。 也許其他人可以更深入地回答。

為什么 ElasticSearch 中的“更像這樣”不遵守單個術語的 TF-IDF 順序？

問題描述

預計 output：

實際 output：

預期 output 的解釋：

關於確定性的注意事項

完整 output

嘗試間隔插入和 MLT 更多

試過`search_type=dfs_query_then_fetch`

補充筆記

1 個解決方案

解決方案1
1 2020-05-17 20:20:01

為什么 ElasticSearch 中的“更像這樣”不遵守單個術語的 TF-IDF 順序？

問題描述

預計 output：

實際 output：

預期 output 的解釋：

關於確定性的注意事項

完整 output

嘗試間隔插入和 MLT 更多

試過search_type=dfs_query_then_fetch

補充筆記

1 個解決方案

解決方案1 1 2020-05-17 20:20:01

試過`search_type=dfs_query_then_fetch`

解決方案1
1 2020-05-17 20:20:01