简体   繁体   English

Elasticsearch:在没有 TF 或 IDF 调整的情况下搜索最频繁的匹配项/术语

[英]Elasticsearch: Search over most frequent matches / terms without TF or IDF adjustment

we are working on a text-based search (via the famous "Type your search here" input box) that computes the score over multiple fields and shows the best results.我们正在研究基于文本的搜索(通过著名的“在此处键入您的搜索”输入框),它计算多个字段的分数并显示最佳结果。 It's basically a bool query with a mixture of "term" and "match" over many different fields (using fuzzyness, ngram, edge-ngrams and others).它基本上是一个布尔查询,在许多不同的字段上混合了“术语”和“匹配”(使用模糊度、ngram、edge-ngrams 等)。

We want the best results (being most "popular") to show up first (thus get the highest score).我们希望最好的结果(最“受欢迎”)首先出现(从而获得最高分)。 However the default TF-IDF algorithm of lucene gives us the exact opposite.然而,lucene 的默认 TF-IDF 算法给了我们完全相反的结果。 Image you look for a vendor that exists in 30% of all index entries.图像您寻找存在于 30% 的所有索引条目中的供应商。 It will have a very high IDF and be ranked very low.它将具有非常高的 IDF 并且排名非常低。 We just want the exact opposite of that - give us the most frequent first(!).我们只想要完全相反的 - 给我们最频繁的第一个(!)。

Trying our best luck with the the "cross-field" query did not work out since we want to combine different query types with "bool".尝试使用“cross-field”查询并没有成功,因为我们想将不同的查询类型与“bool”结合起来。

Now, what we "found out" is that using Okapi BM25 with k1=0 and b=0 almost(?) behaves like a similarity that ignores TF (term frequency) and IDF (inverse document frequency).现在,我们“发现”的是,使用 k1=0 和 b=0 几乎(?)的 Okapi BM25 的行为就像忽略了 TF(词频)和 IDF(逆文档频率)的相似性。 However we feel unsure if this really is the way to go.然而,我们不确定这是否真的是要走的路。

Can you give us some feedback on that, please?你能给我们一些反馈吗?

Is this the way to go or for our "problem" is there better waiting to be discovered?这是要走的路还是对于我们的“问题”是否有更好的等待被发现?


update更新

I try to make my question more clear (sorry for any confusion):我试着让我的问题更清楚(对不起,任何混淆):

Let's say we have an index of cars...假设我们有一个汽车索引......

{id: 1, vendor: Opel, model: Astra, engine: 90hp gasoline}
{id: 2, vendor: Opel, model: Astra, engine: 100hp diesel}
{id: 3, vendor: Opel, model: Astra, engine: 120hp gasoline}
{id: 4, vendor: Chevrolet, model: Astro, engine: 120hp gasoline}

We do a "full text search" over the current user input "astr"我们对当前用户输入的“astr”进行“全文搜索

All fields (vendor, model + engine) are analyzed using the "edge ngram" analyzer {min:2, max:10} to support prefix search.使用“edge ngram”分析器 {min:2, max:10} 分析所有字段(供应商、模型 + 引擎)以支持前缀搜索。

the input "astr" would match all entries #1 - #4 (it's the beginning of "Astra" and "Astro", so all entries would contain an edge ngram match)输入“astr”将匹配所有条目 #1 - #4(它是“Astra”和“Astro”的开头,所以所有条目都将包含一个边 ngram 匹配)

the IDF of "Astr a " is log(4/3) ~= 0,287 “Astr a ”的 IDF 是 log(4/3) ~= 0,287

the IDF of "Astr o " is log(4/1) ~= 1,386 “Astr o ”的 IDF 是 log(4/1) ~= 1,386

so #4 would be ranked better due to the IDF因此,由于 IDF,#4 的排名会更好

However, we want the exact opposite: The "more frequent" (= "more popular") car should be ranked higher.然而,我们想要恰恰相反:“更频繁”(=“更受欢迎”)的汽车应该排名更高。

note: the "cross fields" query will not be sufficient since we combine several different queries (fuzzy, edge ngram, raw) into one large bool query注意:“跨字段”查询是不够的,因为我们将几个不同的查询(模糊、边缘 ngram、原始)合并到一个大的 bool 查询中

It sounds like you want to follow this general process:听起来您想遵循以下一般流程:

  1. Run a complex, custom search query.运行复杂的自定义搜索查询。
  2. Examine the results to determine how much each vendor dominates within the result set检查结果以确定每个供应商在结果集中的主导地位
  3. Reorder the results, boosting cars with more dominant vendors.对结果重新排序,通过更多主导供应商提升汽车。

Solution 1 (most flexible, least performant)解决方案 1 (最灵活,性能最低)

You can get the information for #2 using a terms aggregation on the vendor field.您可以使用供应商字段上的术语聚合来获取 #2 的信息。

Then you can re-query with the necessary derived boosts (costing a second round-trip)然后您可以使用必要的派生提升重新查询(花费第二次往返)

OR或者

Solution 2 (least flexible, most performant)解决方案 2 (最不灵活,性能最高)

If you're content to let vendor popularity trump _score , you can do the following:如果您满足于让vendor popularity胜过_score ,您可以执行以下操作:

  • Run a zero-result query (your current fuzzy match query)运行零结果查询(您当前的模糊匹配查询)
  • ... with a Terms aggregation on vendor ...与vendor条款聚合
  • ... ... with a Top Hits sub-aggregation sorted by _score descending. ... ... 使用按_score降序排序的Top Hits子聚合。

Then your [astr] query results within the aggregation result will look like this:那么聚合结果中的[astr]查询结果将如下所示:

[Opel bucket]
Astra 90hp
Astra 100hp diesel
Astra 120hp
Ascona 144hp (if you had fuzziness 2)
Ascona 230hp (if you had fuzziness 2)

[Chevrolet bucket]
Astro 120hp
Alero 140hp (if you had fuzziness 2)

If you want to use document frequency to boost your results, try rolling your own script_score function inside a function_score clause.如果您想使用文档频率来提高结果,请尝试在function_score子句中滚动您自己的script_score函数。 You can access document frequency of a term inside your scoring function via term statistics .您可以通过term statistics访问评分函数中某个术语的文档频率。

You may discover that an unintended consequence of this approach is that common/generic terms like Corp , Solutions , Computer , Inc , etc will have an outsize influence on your score if you don't explicitly scrub them out as stopwords.您可能会发现这种方法的一个意想不到的后果是,如果您没有明确地将它们作为停用词清除掉,诸如CorpSolutionsComputerInc等常见/通用术语将对您的分数产生巨大影响。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM