[英]Elasticsearch: Search over most frequent matches / terms without TF or IDF adjustment
we are working on a text-based search (via the famous "Type your search here" input box) that computes the score over multiple fields and shows the best results.我们正在研究基于文本的搜索(通过著名的“在此处键入您的搜索”输入框),它计算多个字段的分数并显示最佳结果。 It's basically a bool query with a mixture of "term" and "match" over many different fields (using fuzzyness, ngram, edge-ngrams and others).
它基本上是一个布尔查询,在许多不同的字段上混合了“术语”和“匹配”(使用模糊度、ngram、edge-ngrams 等)。
We want the best results (being most "popular") to show up first (thus get the highest score).我们希望最好的结果(最“受欢迎”)首先出现(从而获得最高分)。 However the default TF-IDF algorithm of lucene gives us the exact opposite.
然而,lucene 的默认 TF-IDF 算法给了我们完全相反的结果。 Image you look for a vendor that exists in 30% of all index entries.
图像您寻找存在于 30% 的所有索引条目中的供应商。 It will have a very high IDF and be ranked very low.
它将具有非常高的 IDF 并且排名非常低。 We just want the exact opposite of that - give us the most frequent first(!).
我们只想要完全相反的 - 给我们最频繁的第一个(!)。
Trying our best luck with the the "cross-field" query did not work out since we want to combine different query types with "bool".尝试使用“cross-field”查询并没有成功,因为我们想将不同的查询类型与“bool”结合起来。
Now, what we "found out" is that using Okapi BM25 with k1=0 and b=0 almost(?) behaves like a similarity that ignores TF (term frequency) and IDF (inverse document frequency).现在,我们“发现”的是,使用 k1=0 和 b=0 几乎(?)的 Okapi BM25 的行为就像忽略了 TF(词频)和 IDF(逆文档频率)的相似性。 However we feel unsure if this really is the way to go.
然而,我们不确定这是否真的是要走的路。
Can you give us some feedback on that, please?你能给我们一些反馈吗?
Is this the way to go or for our "problem" is there better waiting to be discovered?这是要走的路还是对于我们的“问题”是否有更好的等待被发现?
I try to make my question more clear (sorry for any confusion):我试着让我的问题更清楚(对不起,任何混淆):
Let's say we have an index of cars...假设我们有一个汽车索引......
{id: 1, vendor: Opel, model: Astra, engine: 90hp gasoline}
{id: 2, vendor: Opel, model: Astra, engine: 100hp diesel}
{id: 3, vendor: Opel, model: Astra, engine: 120hp gasoline}
{id: 4, vendor: Chevrolet, model: Astro, engine: 120hp gasoline}
We do a "full text search" over the current user input "astr"我们对当前用户输入的“astr”进行“全文搜索”
All fields (vendor, model + engine) are analyzed using the "edge ngram" analyzer {min:2, max:10} to support prefix search.使用“edge ngram”分析器 {min:2, max:10} 分析所有字段(供应商、模型 + 引擎)以支持前缀搜索。
the input "astr" would match all entries #1 - #4 (it's the beginning of "Astra" and "Astro", so all entries would contain an edge ngram match)输入“astr”将匹配所有条目 #1 - #4(它是“Astra”和“Astro”的开头,所以所有条目都将包含一个边 ngram 匹配)
the IDF of "Astr a " is log(4/3) ~= 0,287 “Astr a ”的 IDF 是 log(4/3) ~= 0,287
the IDF of "Astr o " is log(4/1) ~= 1,386 “Astr o ”的 IDF 是 log(4/1) ~= 1,386
so #4 would be ranked better due to the IDF因此,由于 IDF,#4 的排名会更好
However, we want the exact opposite: The "more frequent" (= "more popular") car should be ranked higher.然而,我们想要恰恰相反:“更频繁”(=“更受欢迎”)的汽车应该排名更高。
note: the "cross fields" query will not be sufficient since we combine several different queries (fuzzy, edge ngram, raw) into one large bool query注意:“跨字段”查询是不够的,因为我们将几个不同的查询(模糊、边缘 ngram、原始)合并到一个大的 bool 查询中
It sounds like you want to follow this general process:听起来您想遵循以下一般流程:
Solution 1 (most flexible, least performant)解决方案 1 (最灵活,性能最低)
You can get the information for #2 using a terms aggregation on the vendor field.您可以使用供应商字段上的术语聚合来获取 #2 的信息。
Then you can re-query with the necessary derived boosts (costing a second round-trip)然后您可以使用必要的派生提升重新查询(花费第二次往返)
OR或者
Solution 2 (least flexible, most performant)解决方案 2 (最不灵活,性能最高)
If you're content to let vendor popularity
trump _score
, you can do the following:如果您满足于让
vendor popularity
胜过_score
,您可以执行以下操作:
vendor
vendor
的条款聚合_score
descending. _score
降序排序的Top Hits子聚合。 Then your [astr]
query results within the aggregation result will look like this:那么聚合结果中的
[astr]
查询结果将如下所示:
[Opel bucket]
Astra 90hp
Astra 100hp diesel
Astra 120hp
Ascona 144hp (if you had fuzziness 2)
Ascona 230hp (if you had fuzziness 2)
[Chevrolet bucket]
Astro 120hp
Alero 140hp (if you had fuzziness 2)
If you want to use document frequency to boost your results, try rolling your own script_score function inside a function_score
clause.如果您想使用文档频率来提高结果,请尝试在
function_score
子句中滚动您自己的script_score函数。 You can access document frequency of a term inside your scoring function via term statistics .您可以通过term statistics访问评分函数中某个术语的文档频率。
You may discover that an unintended consequence of this approach is that common/generic terms like Corp
, Solutions
, Computer
, Inc
, etc will have an outsize influence on your score if you don't explicitly scrub them out as stopwords.您可能会发现这种方法的一个意想不到的后果是,如果您没有明确地将它们作为停用词清除掉,诸如
Corp
、 Solutions
、 Computer
、 Inc
等常见/通用术语将对您的分数产生巨大影响。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.