简体   繁体   中英

Search with asciifolding and UTF-8 characters in Elasticsearch

I am indexing all the names on a web page with characters with accents like "José". I want to be able to search the this name with "Jose" and "José".

How should I set up my index mapping and analyzer(s) for a simple index with one field "name"?

I set up an analyzer for the name field like this:

"analyzer": {
  "folding": {
    "tokenizer": "standard",
    "filter": ["lowercase", "asciifolding"]
   }
 }

But it folds all accents into ASCII equivalents and ignores the accent when indexing the "é". I want the "é" char to be in the index and I want to be able to search "José" with either "José" or "Jose".

You need to preserve the original token with the accent. To achieve that you need to redefine your own asciifolding token filter, like this:

PUT /my_index
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "folding" : {
                    "tokenizer" : "standard",
                    "filter" : ["lowercase", "my_ascii_folding"]
                }
            },
            "filter" : {
                "my_ascii_folding" : {
                    "type" : "asciifolding",
                    "preserve_original" : true
                }
            }
        }
    },
    "mappings": {
        "my_type": {
            "properties": {
                "name": {
                    "type": "text",
                    "analyzer": "folding"
                }
            }
        }
    }
}

After that, both tokens jose and josé will be indexed and searchable

This is what I can think of to resolve the folding problem with diacritical marks:

Analyzer used:
{
  "settings": {
    "analysis": {
      "analyzer": {
        "folding": {
          "tokenizer": "standard",
          "filter":  [ "lowercase", "asciifolding" ]
        }
      }
    }
  }
}

Below is the mapping to be used:

mappings used:
    {
      "properties": {
        "title": {
          "type":           "string",
          "analyzer":       "standard",
          "fields": {
            "folded": {
              "type":       "string",
              "analyzer":   "folding"
            }
          }
    }
  }
}
  • The title field uses the standard analyzer and will contain the original word with diacritics in place.
  • The title.folded field uses the folding analyzer, which strips the diacritical marks.

Below is the search query I will use:

{
  "query": {
    "multi_match": {
      "type":     "most_fields",
      "query":    "esta loca",
      "fields": [ "title", "title.folded" ]
    }
  }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM