Search with asciifolding and UTF-8 characters in Elasticsearch

Question

I am indexing all the names on a web page with characters with accents like "José". I want to be able to search the this name with "Jose" and "José".

How should I set up my index mapping and analyzer(s) for a simple index with one field "name"?

I set up an analyzer for the name field like this:

"analyzer": {
  "folding": {
    "tokenizer": "standard",
    "filter": ["lowercase", "asciifolding"]
   }
 }

But it folds all accents into ASCII equivalents and ignores the accent when indexing the "é". I want the "é" char to be in the index and I want to be able to search "José" with either "José" or "Jose".

Answer 1

You need to preserve the original token with the accent. To achieve that you need to redefine your own asciifolding token filter, like this:

PUT /my_index
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "folding" : {
                    "tokenizer" : "standard",
                    "filter" : ["lowercase", "my_ascii_folding"]
                }
            },
            "filter" : {
                "my_ascii_folding" : {
                    "type" : "asciifolding",
                    "preserve_original" : true
                }
            }
        }
    },
    "mappings": {
        "my_type": {
            "properties": {
                "name": {
                    "type": "text",
                    "analyzer": "folding"
                }
            }
        }
    }
}

After that, both tokens jose and josé will be indexed and searchable

Answer 2

This is what I can think of to resolve the folding problem with diacritical marks:

Analyzer used:
{
  "settings": {
    "analysis": {
      "analyzer": {
        "folding": {
          "tokenizer": "standard",
          "filter":  [ "lowercase", "asciifolding" ]
        }
      }
    }
  }
}

Below is the mapping to be used:

mappings used:
    {
      "properties": {
        "title": {
          "type":           "string",
          "analyzer":       "standard",
          "fields": {
            "folded": {
              "type":       "string",
              "analyzer":   "folding"
            }
          }
    }
  }
}

The title field uses the standard analyzer and will contain the original word with diacritics in place.
The title.folded field uses the folding analyzer, which strips the diacritical marks.

Below is the search query I will use:

{
  "query": {
    "multi_match": {
      "type":     "most_fields",
      "query":    "esta loca",
      "fields": [ "title", "title.folded" ]
    }
  }
}

Search with asciifolding and UTF-8 characters in Elasticsearch

Question

2 answers

solution1
7 ACCPTED 2017-07-18 15:04:10

solution2
0 2017-07-22 11:48:17

Search with asciifolding and UTF-8 characters in Elasticsearch

Question

2 answers

solution1 7 ACCPTED 2017-07-18 15:04:10

solution2 0 2017-07-22 11:48:17

solution1
7 ACCPTED 2017-07-18 15:04:10

solution2
0 2017-07-22 11:48:17