为什么 ElasticSearch 中的“更像这样”不遵守单个术语的 TF-IDF 顺序？

Question

I've been trying to grok the "More Like This" functionality in ElasticSearch. I've read and re-read the documentation but I'm having trouble understanding why the following behavior occurs.我一直在尝试理解 ElasticSearch 中的“更像这样”功能。我已经阅读并重新阅读了文档，但我无法理解为什么会出现以下行为。

Basically, I insert three documents, and I try a "More Like This Query" with max_query_terms=1 , expecting that the higher TF-IDF term is used, but that doesn't seem to be the case.基本上，我插入了三个文档，并尝试使用max_query_terms=1进行“更像这个查询”，期望使用更高的 TF-IDF 术语，但事实并非如此。

curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
   "message": "dog barks"
}';
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
   "message": "cat fur"
}';
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
   "message": "cat naps"
}';
curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
    "query": {
        "more_like_this" : {
            "like" : ["cat", "dog"],
            "fields" : ["message"],
            "minimum_should_match" : 1,
            "min_term_freq" : 1,
            "min_doc_freq" : 1,
            "max_query_terms" : 1
        }
    }
}';

Expected output:预计 output：

"dog barks" document "dog barks"文件

Actual output:实际 output：

"cat naps" and "cat fur" documents (Also, see note about determinism below) "cat naps"和"cat fur"文档（另请参阅下面有关确定性的说明）

Explanation for expected output:预期 output 的解释：

In the documentation it mentions在文档中它提到

Suppose we wanted to find all documents similar to a given input document.假设我们想找到与给定输入文档相似的所有文档。 Obviously, the input document itself should be its best match for that type of query.显然，输入文档本身应该是该类型查询的最佳匹配项。 And the reason would be mostly, according to Lucene scoring formula, due to the terms with the highest tf-idf.根据 Lucene 评分公式，原因主要是由于 tf-idf 最高的术语。 Therefore, the terms of the input document that have the highest tf-idf are good representatives of that document, and could be used within a disjunctive query (or OR) to retrieve similar documents.因此，具有最高 tf-idf 的输入文档的术语是该文档的良好代表，并且可以在析取查询（或 OR）中使用以检索相似的文档。 The MLT query simply extracts the text from the input document, analyzes it, usually using the same analyzer at the field, then selects the top K terms with highest tf-idf to form a disjunctive query of these terms. MLT 查询简单地从输入文档中提取文本，对其进行分析，通常在该字段使用相同的分析器，然后选择具有最高 tf-idf 的前 K 个术语以形成这些术语的析取查询。

Since I specified max_query_terms = 1 , only the term from the input document with the highest TF-IDF score should be used in the disjunctive query.由于我指定max_query_terms = 1 ，因此只有输入文档中具有最高 TF-IDF 分数的术语才能用于析取查询。 In this case, the input document has two terms.在这种情况下，输入文档有两个术语。 They have the same term frequency in the input document, but cat appears twice as often in the corpus, so it has a higher document frequency.它们在输入文档中具有相同的词频，但 cat 在语料库中出现的频率是其两倍，因此它具有更高的文档频率。 Therefore, dog should have a higher TF-IDF score than cat , and therefore I'd expect that the disjunctive query is just "message":"dog" and the returned result is the "dog barks" event.因此， dog的 TF-IDF 分数应该高于cat ，因此我希望析取查询只是"message":"dog"并且返回的结果是"dog barks"事件。

I'm trying to understand what's going on here.我试图了解这里发生了什么。 Any help is very greatly appreciated.非常感谢任何帮助。 :) :)

Note about Determinism关于确定性的注意事项

I tried rerunning this setup a few times.我尝试重新运行此设置几次。 When running the 4 ES commands (3 POST + MLT GET) above following a curl -XDELETE 'http://localhost:9200/samples' command, sometimes I'd get "cat naps" and "cat fur" , but other times I'd get "cat naps" , "cat fur" , and "dog barks" , and a few times I'd even get just "dog barks" .在curl -XDELETE 'http://localhost:9200/samples'命令后运行上面的 4 个 ES 命令（3 POST + MLT GET）时，有时我会得到"cat naps"和"cat fur" ，但其他时候我会听到"cat naps" 、 "cat fur"和"dog barks" ，有几次我什至只会听到"dog barks" 。

Full output完整 output

Earlier I handwaved and just said what the outputs were for the GET query.早些时候我用手挥了挥手，只是说了 GET 查询的输出是什么。 Let me be more precise Actual output #1 (happens some of the time):让我更准确地说实际 output #1（有时会发生）：

{"took":1,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":2,"max_score":0.6931472,"hits":
[{"_index":"samples","_type":"_doc","_id":"UHAoI3IBapDWjHWvsQ0_","_score":0.6931472,"_source":{
   "message": "cat fur"
}},{"_index":"samples","_type":"_doc","_id":"UXAoI3IBapDWjHWvsQ1c","_score":0.2876821,"_source":{
   "message": "cat naps"
}}]}}

Actual output #2 (happens some of the time):实际 output #2（有时会发生）：

{"took":2,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":3,"max_score":0.2876821,"hits":
[{"_index":"samples","_type":"_doc","_id":"VHAtI3IBapDWjHWvvA0B","_score":0.2876821,"_source":{
   "message": "cat fur"
}},{"_index":"samples","_type":"_doc","_id":"U3AtI3IBapDWjHWvuw3l","_score":0.2876821,"_source":{
   "message": "dog barks"
}},{"_index":"samples","_type":"_doc","_id":"VXAtI3IBapDWjHWvvA0V","_score":0.2876821,"_source":{
   "message": "cat naps"
}}]}}

Actual output #3 (happens most rarely of the three):实际 output #3（三者中最罕见的情况）：

{"took":1,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":1,"max_score":0.9808292,"hits":
[{"_index":"samples","_type":"_doc","_id":"WXAzI3IBapDWjHWvbQ3s","_score":0.9808292,"_source":{
   "message": "dog barks"
}}]}}

Tried spacing out insertions and MLT more尝试间隔插入和 MLT 更多

Maybe elasticsearch is in a weird "processing state" and needs a bit of time between documents.也许 elasticsearch 处于一种奇怪的“处理状态”，需要在文档之间留出一些时间。 So I gave ES some time between inserting the documents and before running the GET command.所以我在插入文档和运行 GET 命令之间给了 ES 一些时间。

filename="testEsOutput-10-incremental.txt"
amount=10
echo "Test-10-incremental"
for i in {1..10}
do
    curl -XDELETE 'http://localhost:9200/samples';
    sleep $amount
    curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
       "message": "dog barks"
    }';
    sleep $amount
    curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
       "message": "cat fur"
    }';
    sleep $amount
    curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
       "message": "cat naps"
    }';
    sleep $amount

    curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
        "query": {
            "more_like_this" : {
                "like" : ["cat", "dog"],
                "fields" : ["message"],
                "minimum_should_match" : 1,
                "min_term_freq" : 1,
                "min_doc_freq" : 1,
                "max_query_terms" : 1
            }
        }
    }' >> $filename
    echo "\n\r----\n\r" >> $filename
    echo "----\n\r" >> $filename
done
echo "Done!"

However this did not seem to affect the non-deterministic output in any meaningful way.然而，这似乎并没有以任何有意义的方式影响非确定性的 output。

Tried `search_type=dfs_query_then_fetch`试过`search_type=dfs_query_then_fetch`

Following this SO post about ES nondeterminism , I tried adding the dfs_query_then_fetch option, aka在这篇关于 ES nondeterminism 的 SO 帖子之后，我尝试添加 dfs_query_then_fetch 选项，又名

curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/?search_type=dfs_query_then_fetch' -d '{
        "query": {
            "more_like_this" : {
                "like" : ["cat", "dog"],
                "fields" : ["message"],
                "minimum_should_match" : 1,
                "min_term_freq" : 1,
                "min_doc_freq" : 1,
                "max_query_terms" : 1
            }
        }
    }'

but still, the results were not deterministic and they varied between the three options.但是，结果仍然不是确定性的，并且它们在三个选项之间有所不同。

Additional Notes补充笔记

I tried looking at additional debug information via我尝试通过查看其他调试信息

curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_validate/query?rewrite=true' -d '{
    "query": {
        "more_like_this" : {
            "like" : ["cat", "dog"],
            "fields" : ["message"],
            "minimum_should_match" : 1,
            "min_term_freq" : 1,
            "min_doc_freq" : 1,
            "max_query_terms" : 1
        }
    }
}';

but this sometimes output但这有时 output

{"_shards":{"total":1,"successful":1,"failed":0},"valid":true,"explanations":
[{"index":"samples","valid":true,"explanation":"message:cat"}]}

and other times其他时候

{"_shards":{"total":1,"successful":1,"failed":0},"valid":true,"explanations":
[{"index":"samples","valid":true,"explanation":"like:[cat, dog]"}]}

so the output wasn't even deterministic (running it back to back).所以 output 甚至不是确定性的（背靠背运行）。

Note: Tested on ElasticSearch 6.8.8, both locally and in online REPL.注意：在 ElasticSearch 6.8.8 上进行了本地和在线 REPL 测试。 Also tested by using an actual document, eg还通过使用实际文档进行了测试，例如

curl -XPUT --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/72 -d '{
   "message" : "dog cat"
}';
curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
    "query": {
        "more_like_this" : {
            "like" : {
                "_id" : "72"
            }
            ,
            "fields" : ["message"],
            "minimum_should_match" : 1,
            "min_term_freq" : 1,
            "min_doc_freq" : 1,
            "max_query_terms" : 1
        }
    }
}';

but got the same "cat naps" and "cat fur" events.但得到了相同的"cat naps"和"cat fur"事件。

Answer 1

Okay, after much debugging, I tried limiting the index to just one shard, aka好的，经过多次调试，我尝试将索引限制为一个分片，也就是

curl -XPUT --header 'Content-Type: application/json' 'http://localhost:9200/samples' -d '{
    "settings" : {
        "index" : {
            "number_of_shards" : 1, 
            "number_of_replicas" : 0 
        }
    }
}';

When I did this, I got, 100% of the time, only the "dog barks" document.当我这样做时，100% 的时间里，我只得到了"dog barks"文件。

It seems that even when using the search_type=dfs_query_then_fetch option (with a multi-shard index), ES still wasn't doing a perfectly accurate job.似乎即使在使用search_type=dfs_query_then_fetch选项（使用多分片索引）时，ES 仍然没有完全准确地完成工作。 I'm not sure what other options I could use to force accurate behavior.我不确定我可以使用哪些其他选项来强制执行准确的行为。 Maybe someone else can answer with more insight.也许其他人可以更深入地回答。

为什么 ElasticSearch 中的“更像这样”不遵守单个术语的 TF-IDF 顺序？

问题描述

Expected output:预计 output：

Actual output:实际 output：

Explanation for expected output:预期 output 的解释：

Note about Determinism关于确定性的注意事项

Full output完整 output

Tried spacing out insertions and MLT more尝试间隔插入和 MLT 更多

Tried `search_type=dfs_query_then_fetch`试过`search_type=dfs_query_then_fetch`

Additional Notes补充笔记

1 个解决方案

解决方案1
1 2020-05-17 20:20:01

为什么 ElasticSearch 中的“更像这样”不遵守单个术语的 TF-IDF 顺序？

问题描述

Expected output:预计 output：

Actual output:实际 output：

Explanation for expected output:预期 output 的解释：

Note about Determinism关于确定性的注意事项

Full output完整 output

Tried spacing out insertions and MLT more尝试间隔插入和 MLT 更多

Tried search_type=dfs_query_then_fetch试过search_type=dfs_query_then_fetch

Additional Notes补充笔记

1 个解决方案

解决方案1 1 2020-05-17 20:20:01

Tried `search_type=dfs_query_then_fetch`试过`search_type=dfs_query_then_fetch`

解决方案1
1 2020-05-17 20:20:01