为什么 ElasticSearch 中的“更像这样”不遵守单个术语的 TF-IDF 顺序？

Question

我一直在尝试理解 ElasticSearch 中的“更像这样”功能。我已经阅读并重新阅读了文档，但我无法理解为什么会出现以下行为。

基本上，我插入了三个文档，并尝试使用max_query_terms=1进行“更像这个查询”，期望使用更高的 TF-IDF 术语，但事实并非如此。

curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
   "message": "dog barks"
}';
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
   "message": "cat fur"
}';
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
   "message": "cat naps"
}';
curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
    "query": {
        "more_like_this" : {
            "like" : ["cat", "dog"],
            "fields" : ["message"],
            "minimum_should_match" : 1,
            "min_term_freq" : 1,
            "min_doc_freq" : 1,
            "max_query_terms" : 1
        }
    }
}';

预计 output：

"dog barks"文件

实际 output：

"cat naps"和"cat fur"文档（另请参阅下面有关确定性的说明）

预期 output 的解释：

在文档中它提到

假设我们想找到与给定输入文档相似的所有文档。 显然，输入文档本身应该是该类型查询的最佳匹配项。 根据 Lucene 评分公式，原因主要是由于 tf-idf 最高的术语。 因此，具有最高 tf-idf 的输入文档的术语是该文档的良好代表，并且可以在析取查询（或 OR）中使用以检索相似的文档。 MLT 查询简单地从输入文档中提取文本，对其进行分析，通常在该字段使用相同的分析器，然后选择具有最高 tf-idf 的前 K 个术语以形成这些术语的析取查询。

由于我指定max_query_terms = 1 ，因此只有输入文档中具有最高 TF-IDF 分数的术语才能用于析取查询。 在这种情况下，输入文档有两个术语。 它们在输入文档中具有相同的词频，但 cat 在语料库中出现的频率是其两倍，因此它具有更高的文档频率。 因此， dog的 TF-IDF 分数应该高于cat ，因此我希望析取查询只是"message":"dog"并且返回的结果是"dog barks"事件。

我试图了解这里发生了什么。 非常感谢任何帮助。 :)

关于确定性的注意事项

我尝试重新运行此设置几次。 在curl -XDELETE 'http://localhost:9200/samples'命令后运行上面的 4 个 ES 命令（3 POST + MLT GET）时，有时我会得到"cat naps"和"cat fur" ，但其他时候我会听到"cat naps" 、 "cat fur"和"dog barks" ，有几次我什至只会听到"dog barks" 。

完整 output

早些时候我用手挥了挥手，只是说了 GET 查询的输出是什么。 让我更准确地说实际 output #1（有时会发生）：

{"took":1,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":2,"max_score":0.6931472,"hits":
[{"_index":"samples","_type":"_doc","_id":"UHAoI3IBapDWjHWvsQ0_","_score":0.6931472,"_source":{
   "message": "cat fur"
}},{"_index":"samples","_type":"_doc","_id":"UXAoI3IBapDWjHWvsQ1c","_score":0.2876821,"_source":{
   "message": "cat naps"
}}]}}

实际 output #2（有时会发生）：

{"took":2,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":3,"max_score":0.2876821,"hits":
[{"_index":"samples","_type":"_doc","_id":"VHAtI3IBapDWjHWvvA0B","_score":0.2876821,"_source":{
   "message": "cat fur"
}},{"_index":"samples","_type":"_doc","_id":"U3AtI3IBapDWjHWvuw3l","_score":0.2876821,"_source":{
   "message": "dog barks"
}},{"_index":"samples","_type":"_doc","_id":"VXAtI3IBapDWjHWvvA0V","_score":0.2876821,"_source":{
   "message": "cat naps"
}}]}}

实际 output #3（三者中最罕见的情况）：

{"took":1,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":1,"max_score":0.9808292,"hits":
[{"_index":"samples","_type":"_doc","_id":"WXAzI3IBapDWjHWvbQ3s","_score":0.9808292,"_source":{
   "message": "dog barks"
}}]}}

尝试间隔插入和 MLT 更多

也许 elasticsearch 处于一种奇怪的“处理状态”，需要在文档之间留出一些时间。 所以我在插入文档和运行 GET 命令之间给了 ES 一些时间。

filename="testEsOutput-10-incremental.txt"
amount=10
echo "Test-10-incremental"
for i in {1..10}
do
    curl -XDELETE 'http://localhost:9200/samples';
    sleep $amount
    curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
       "message": "dog barks"
    }';
    sleep $amount
    curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
       "message": "cat fur"
    }';
    sleep $amount
    curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
       "message": "cat naps"
    }';
    sleep $amount

    curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
        "query": {
            "more_like_this" : {
                "like" : ["cat", "dog"],
                "fields" : ["message"],
                "minimum_should_match" : 1,
                "min_term_freq" : 1,
                "min_doc_freq" : 1,
                "max_query_terms" : 1
            }
        }
    }' >> $filename
    echo "\n\r----\n\r" >> $filename
    echo "----\n\r" >> $filename
done
echo "Done!"

然而，这似乎并没有以任何有意义的方式影响非确定性的 output。

试过`search_type=dfs_query_then_fetch`

在这篇关于 ES nondeterminism 的 SO 帖子之后，我尝试添加 dfs_query_then_fetch 选项，又名

curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/?search_type=dfs_query_then_fetch' -d '{
        "query": {
            "more_like_this" : {
                "like" : ["cat", "dog"],
                "fields" : ["message"],
                "minimum_should_match" : 1,
                "min_term_freq" : 1,
                "min_doc_freq" : 1,
                "max_query_terms" : 1
            }
        }
    }'

但是，结果仍然不是确定性的，并且它们在三个选项之间有所不同。

补充笔记

我尝试通过查看其他调试信息

curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_validate/query?rewrite=true' -d '{
    "query": {
        "more_like_this" : {
            "like" : ["cat", "dog"],
            "fields" : ["message"],
            "minimum_should_match" : 1,
            "min_term_freq" : 1,
            "min_doc_freq" : 1,
            "max_query_terms" : 1
        }
    }
}';

但这有时 output

{"_shards":{"total":1,"successful":1,"failed":0},"valid":true,"explanations":
[{"index":"samples","valid":true,"explanation":"message:cat"}]}

其他时候

{"_shards":{"total":1,"successful":1,"failed":0},"valid":true,"explanations":
[{"index":"samples","valid":true,"explanation":"like:[cat, dog]"}]}

所以 output 甚至不是确定性的（背靠背运行）。

注意：在 ElasticSearch 6.8.8 上进行了本地和在线 REPL 测试。 还通过使用实际文档进行了测试，例如

curl -XPUT --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/72 -d '{
   "message" : "dog cat"
}';
curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
    "query": {
        "more_like_this" : {
            "like" : {
                "_id" : "72"
            }
            ,
            "fields" : ["message"],
            "minimum_should_match" : 1,
            "min_term_freq" : 1,
            "min_doc_freq" : 1,
            "max_query_terms" : 1
        }
    }
}';

但得到了相同的"cat naps"和"cat fur"事件。

Answer 1

好的，经过多次调试，我尝试将索引限制为一个分片，也就是

curl -XPUT --header 'Content-Type: application/json' 'http://localhost:9200/samples' -d '{
    "settings" : {
        "index" : {
            "number_of_shards" : 1, 
            "number_of_replicas" : 0 
        }
    }
}';

当我这样做时，100% 的时间里，我只得到了"dog barks"文件。

似乎即使在使用search_type=dfs_query_then_fetch选项（使用多分片索引）时，ES 仍然没有完全准确地完成工作。 我不确定我可以使用哪些其他选项来强制执行准确的行为。 也许其他人可以更深入地回答。

为什么 ElasticSearch 中的“更像这样”不遵守单个术语的 TF-IDF 顺序？

问题描述

预计 output：

实际 output：

预期 output 的解释：

关于确定性的注意事项

完整 output

尝试间隔插入和 MLT 更多

试过`search_type=dfs_query_then_fetch`

补充笔记

1 个解决方案

解决方案1
1 2020-05-17 20:20:01

为什么 ElasticSearch 中的“更像这样”不遵守单个术语的 TF-IDF 顺序？

问题描述

预计 output：

实际 output：

预期 output 的解释：

关于确定性的注意事项

完整 output

尝试间隔插入和 MLT 更多

试过search_type=dfs_query_then_fetch

补充笔记

1 个解决方案

解决方案1 1 2020-05-17 20:20:01

试过`search_type=dfs_query_then_fetch`

解决方案1
1 2020-05-17 20:20:01