Elasticsearch本机脚本-评估索引文档的字段值

Question

I'm trying to modify the Cosine Similarity Script from imotov on Github . 我正在尝试从Github上的 imotov修改余弦相似脚本。 In his script, his docWeightSum only takes the term frequency (tf) of terms that are in the query, not all the terms in the document itself. 在他的脚本中，他的docWeightSum仅采用查询中术语的术语频率（tf），而不是文档本身中的所有术语。

Take this example below. 请在下面举这个例子。 The docWeightSum would be 9 (4 for "I", 4 for "am", 1 for "Sam"). docWeightSum为9（“ I”为4，“ am”为4，“ Sam”为1）。 What I want to the docWeightSum to be is 10 (add 1 for "ham") because I want to normalize the dot product by both the magnitudes of two vectors. 我想将docWeightSum设置为10（为“ ham”加1），因为我想通过两个向量的两个量值对点积进行归一化。

doc: "I am am I ham Sam" doc：“我是我火腿萨姆”

query: "Sam I am" 查询：“我是山姆”

So I actually have 2 questions, as I index document into Elasticsearch like this: 所以我实际上有两个问题，因为我将文档索引到Elasticsearch中是这样的：

POST /termscore/doc
{
   "text": "I am am I ham",
   "docWeightSum": 9
}

Is there existing API to get the sum square of all tf for each indexed document, or to get tf of terms in the document that are not in the query? 是否有现有的API获取每个索引文档的所有tf的平方和，或获取文档中不在查询中的术语的tf？ If not, then how can I compute this sum square? 如果不是，那我怎么计算这个平方和？
If I precompute the sum square of tf of each document and put into Elasticsearch along with the document content, as in the example above, then when computing the score, how can I access that "docWeightSum" value? 如果像上面的示例一样，预先计算每个文档的tf的平方和并与文档内容一起放入Elasticsearch，那么在计算分数时，如何访问该“ docWeightSum”值？

I am using Elasticsearch 1.7 我正在使用Elasticsearch 1.7

Thanks, 谢谢，

Answer 1

To answer your question, it's possible, but it would be very inefficient to calculate docWeightSum in runtime. 可以回答您的问题，但是在运行时计算docWeightSum效率很低。 So, assuming that you precompute the value and index it in a separate field, you can access these values from a native script using doc lookup mechanism. 因此，假设您预先计算了值并将其索引在单独的字段中，则可以使用doc查找机制从本机脚本访问这些值。 If your calculations are not very complex you might be able to get by using field value factor in a function_score query and avoid writing your own script altogether. 如果您的计算不是很复杂，则可以通过在function_score查询中使用字段值因子来获得，并避免完全编写自己的脚本。

Saying that, I suspect you are asking a wrong question. 话虽如此，我怀疑你在问一个错误的问题。 Instead of trying to implement it as a scoring script, I would suggest to look into creating your own custom SimilarityProvider. 我建议不要尝试将其作为评分脚本来实现，而应考虑创建自己的自定义类似性提供者。 You will most likely find that most of the constructs that you are trying to shoehorn into score script are already there and much easier to implement and use. 您很可能会发现，您试图将其拖入得分脚本的大多数结构已经存在，并且更易于实现和使用。

Elasticsearch本机脚本-评估索引文档的字段值

问题描述

1 个解决方案

解决方案1
0 2016-01-23 00:44:23

Elasticsearch本机脚本-评估索引文档的字段值

问题描述

1 个解决方案

解决方案1 0 2016-01-23 00:44:23

解决方案1
0 2016-01-23 00:44:23