I am indexing all the names on a web page with characters with accents like "José". I want to be able to search the this name with "Jose" and "José".
How should I set up my index mapping and analyzer(s) for a simple index with one field "name"?
I set up an analyzer for the name field like this:
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": ["lowercase", "asciifolding"]
}
}
But it folds all accents into ASCII equivalents and ignores the accent when indexing the "é". I want the "é" char to be in the index and I want to be able to search "José" with either "José" or "Jose".
You need to preserve the original token with the accent. To achieve that you need to redefine your own asciifolding
token filter, like this:
PUT /my_index
{
"settings" : {
"analysis" : {
"analyzer" : {
"folding" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "my_ascii_folding"]
}
},
"filter" : {
"my_ascii_folding" : {
"type" : "asciifolding",
"preserve_original" : true
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "text",
"analyzer": "folding"
}
}
}
}
}
After that, both tokens jose
and josé
will be indexed and searchable
This is what I can think of to resolve the folding problem with diacritical marks:
Analyzer used:
{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": [ "lowercase", "asciifolding" ]
}
}
}
}
}
Below is the mapping to be used:
mappings used:
{
"properties": {
"title": {
"type": "string",
"analyzer": "standard",
"fields": {
"folded": {
"type": "string",
"analyzer": "folding"
}
}
}
}
}
Below is the search query I will use:
{
"query": {
"multi_match": {
"type": "most_fields",
"query": "esta loca",
"fields": [ "title", "title.folded" ]
}
}
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.