简体   繁体   中英

Elasticsearch: Search over most frequent matches / terms without TF or IDF adjustment

we are working on a text-based search (via the famous "Type your search here" input box) that computes the score over multiple fields and shows the best results. It's basically a bool query with a mixture of "term" and "match" over many different fields (using fuzzyness, ngram, edge-ngrams and others).

We want the best results (being most "popular") to show up first (thus get the highest score). However the default TF-IDF algorithm of lucene gives us the exact opposite. Image you look for a vendor that exists in 30% of all index entries. It will have a very high IDF and be ranked very low. We just want the exact opposite of that - give us the most frequent first(!).

Trying our best luck with the the "cross-field" query did not work out since we want to combine different query types with "bool".

Now, what we "found out" is that using Okapi BM25 with k1=0 and b=0 almost(?) behaves like a similarity that ignores TF (term frequency) and IDF (inverse document frequency). However we feel unsure if this really is the way to go.

Can you give us some feedback on that, please?

Is this the way to go or for our "problem" is there better waiting to be discovered?


update

I try to make my question more clear (sorry for any confusion):

Let's say we have an index of cars...

{id: 1, vendor: Opel, model: Astra, engine: 90hp gasoline}
{id: 2, vendor: Opel, model: Astra, engine: 100hp diesel}
{id: 3, vendor: Opel, model: Astra, engine: 120hp gasoline}
{id: 4, vendor: Chevrolet, model: Astro, engine: 120hp gasoline}

We do a "full text search" over the current user input "astr"

All fields (vendor, model + engine) are analyzed using the "edge ngram" analyzer {min:2, max:10} to support prefix search.

the input "astr" would match all entries #1 - #4 (it's the beginning of "Astra" and "Astro", so all entries would contain an edge ngram match)

the IDF of "Astr a " is log(4/3) ~= 0,287

the IDF of "Astr o " is log(4/1) ~= 1,386

so #4 would be ranked better due to the IDF

However, we want the exact opposite: The "more frequent" (= "more popular") car should be ranked higher.

note: the "cross fields" query will not be sufficient since we combine several different queries (fuzzy, edge ngram, raw) into one large bool query

It sounds like you want to follow this general process:

  1. Run a complex, custom search query.
  2. Examine the results to determine how much each vendor dominates within the result set
  3. Reorder the results, boosting cars with more dominant vendors.

Solution 1 (most flexible, least performant)

You can get the information for #2 using a terms aggregation on the vendor field.

Then you can re-query with the necessary derived boosts (costing a second round-trip)

OR

Solution 2 (least flexible, most performant)

If you're content to let vendor popularity trump _score , you can do the following:

  • Run a zero-result query (your current fuzzy match query)
  • ... with a Terms aggregation on vendor
  • ... ... with a Top Hits sub-aggregation sorted by _score descending.

Then your [astr] query results within the aggregation result will look like this:

[Opel bucket]
Astra 90hp
Astra 100hp diesel
Astra 120hp
Ascona 144hp (if you had fuzziness 2)
Ascona 230hp (if you had fuzziness 2)

[Chevrolet bucket]
Astro 120hp
Alero 140hp (if you had fuzziness 2)

If you want to use document frequency to boost your results, try rolling your own script_score function inside a function_score clause. You can access document frequency of a term inside your scoring function via term statistics .

You may discover that an unintended consequence of this approach is that common/generic terms like Corp , Solutions , Computer , Inc , etc will have an outsize influence on your score if you don't explicitly scrub them out as stopwords.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM