简体   繁体   中英

Elasticsearch match exact terms with spaces across different fields

My data in elastic is setup with different fields: categories, subcategories, instruments and moods. My goal is to provide results with only exact matches for all of the keywords passed to it and only return results that match everything. So far, this seems to work until I use a keyword that consists of multiple words separated with a space, like so:

"query": {
    "bool": {
      "must": [
        {
          "match": {
            "categories": "Electronic"
          }
        },
        {
          "match": {
            "categories": "Pop"
          }
        },
        {
          "match": {
            "instruments": "Female Vocal"
          }
        }
      ]
    }
}

My data in ES consists of this type of data:

[name] => Some Data Name
[categories] => Electronic,Pop
[subcategories] => 1970s,Alternative,Experimental,Retro
[instruments] => Electronic Drums,Male Vocal,Synth
[moods] => Fun,Futuristic,Pulsing,Quirky,Rhythmic

So, it's matching the "Vocal" part of the instruments field, but doesn't perform an exact match for "Female Vocal".

Would this be solved by an ES filter perhaps?

EDIT : To account for other characters, I expanded the sample data set a bit:

[categories]=>R&B,Dance/House
[instruments] => Electronic Drums,Male Vocal,Synth
[moods] => Fun,Futuristic,Pulsing,Quirky,Rhythmic

So, there might be ampersands, slashes and spaces used. A comma would separate separate terms.

SOLVED I ended up looking more into analyzers and realized that I probably need to create a custom one to account for the boundaries of my keywords.

myesurl/tracks/_settings    
{
      "index": {
        "analysis": {
          "tokenizer": {
            "comma": {
              "type": "pattern",
              "pattern": ","
            }
          },
          "analyzer": {
            "tracks_analyzer": {
              "type": "custom",
              "tokenizer": "comma",
              "filter": [
                "trim",
                "lowercase"
              ]
            }
          }
        }
      }
    }

Then I setup a mapping:

{
  "track": {
    "properties": {
      "categories": {
        "type": "string",
        "analyzer": "tracks_analyzer"
      },
      "subcategories": {
        "type": "string",
        "analyzer": "tracks_analyzer"
      },
      "instruments": {
        "type": "string",
        "analyzer": "tracks_analyzer"
      },
      "moods": {
        "type": "string",
        "analyzer": "tracks_analyzer"
      }
    }
  }
}

And then pushed the content into the elasticsearch. Seemed to work as intended. It now accounts for any character in the keyword, as long as the keyword matches to a token that was created by separated commas.

Using match queries means that whatever string you put in are analyzed by the standard analyzer, and thus split on whitespaces and lowercased. So as you could see, you're fine as long as you're matching a single word per field, however, the fun comes whenever what you're searching contains space(s).

What happens is that at indexing time, Female Vocal would be split in two tokens female and vocal and indexed into the instruments field. The same goes for Male Vocal being indexed as two tokens male and vocal . and thus would also match fields with Male Vocal . Then when you're match ing on Female Vocal , what happens is that the search terms are split and lowercased as well into female and vocal and the term vocal would match both documents with Male Vocal and Female Vocal .

If you want exact matching, you need two things: 1. declare those string fields you need to match exactly as not_analyzed in your mapping 2. use term queries (or term filters ) which do not analyze the search terms.

The first point is easily made with such a mapping:

curl -XPUT localhost:9200/my_index -d '{
   "mappings": {
       "my_type": {
           "properties": {
               "categories": {
                   "type": "string",
                   "index": "not_analyzed"
               },
               "subcategories": {
                   "type": "string",
                   "index": "not_analyzed"
               },
               "instruments": {
                   "type": "string",
                   "index": "not_analyzed"
               },
               "moods": {
                   "type": "string",
                   "index": "not_analyzed"
               },
               ...
           }
       }
   }
}'

With such a mapping, Female Vocal will not be analyzed (ie not indexed as female and vocal ) but be indexed verbatim as Female Vocal .

Then you can query exact field values with a query like this:

curl -XPOST localhost:9200/my_index/my_type/_search -d '{
    "query": {
        "bool": {
          "must": [
            {
              "term": {
                "categories": "Electronic"
              }
            },
            {
              "term": {
                "categories": "Pop"
              }
            },
            {
              "term": {
                "instruments": "Female Vocal"
              }
            }
          ]
        }
    }
}'

I ended up looking more into analyzers and realized that I probably need to create a custom one to account for the boundaries of my keywords.

myesurl/tracks/_settings    
{
      "index": {
        "analysis": {
          "tokenizer": {
            "comma": {
              "type": "pattern",
              "pattern": ","
            }
          },
          "analyzer": {
            "tracks_analyzer": {
              "type": "custom",
              "tokenizer": "comma",
              "filter": [
                "trim",
                "lowercase"
              ]
            }
          }
        }
      }
    }

Then I setup a mapping:

{
  "track": {
    "properties": {
      "categories": {
        "type": "string",
        "analyzer": "tracks_analyzer"
      },
      "subcategories": {
        "type": "string",
        "analyzer": "tracks_analyzer"
      },
      "instruments": {
        "type": "string",
        "analyzer": "tracks_analyzer"
      },
      "moods": {
        "type": "string",
        "analyzer": "tracks_analyzer"
      }
    }
  }
}

And then pushed the content into the elasticsearch. Seemed to work as intended. It now accounts for any character in the keyword, as long as the keyword matches to a token that was created by separated commas.

A nice solution would be using match and minimum_should_match , providing the percentage of the words you want to match. It can be 100% and will return the results containing at least the given text;

It is important that this approach is NOT considering the order of the words.

"query":{
  "bool":{
     "should":[
        {
           "match":{
              "my_text":{
                 "query":"I want to buy a new new car",
                 "minimum_should_match":"90%"
              }
           }
        }
     ]
  }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM