简体   繁体   中英

ElasticSearch custom analyzer breaks words containing special characters

If user searches for foo(bar) , elasticsearch breaks it into foo and bar .

What I'm trying to achieve, is when a user types in say, i want a foo(bar) , I match exactly an item named foo(bar) , the name is fixed, and it will be used by a filter, so it is set to a keyword type.

The approximate steps I did,

  1. define a custom analyzer
  2. define a dictionary containing foo(bar)
  3. define a synonym mapping containing abc => foo(bar)

Now, when I search for abc , elasticsearch translates it to foo(bar) , but then it breaks it into foo and bar .

The question is, as you may have known, is how to preserve special characters in elasticsearch analyzer?

I tried to use quotes(") in the dictionary file, like "foo(bar)" , it didn't work. Or is there maybe maybe another way to work around this problem?

By the way, I'm using foo(bar) here just for simplicity, the actual case is much more complicated.

Thanks in advance.

You might want to use another tokenizer in your custom analyzer for your index.

For example, the standard tokenizer (used via analyzer for short) splits by all non-word characters ( \W+ ):

POST _analyze
{
  "analyzer": "standard",
  "text": "foo(bar)"
}

==>

{
  "tokens" : [
    {
      "token" : "foo",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "bar",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

Compare to a custom tokenizer, that splits by all non-word characters except the ( and ) ( which is [^\w\(\)]+ ):

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "[^\\w\\(\\)]+"
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "foo(bar)"
}

===>

{
  "tokens" : [
    {
      "token" : "foo(bar)",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "word",
      "position" : 0
    }
  ]
}

I used a Pattern Tokenier as an example to exclude certain symbols ( ( and ) in your case) from being used in tokenization.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM