简体   繁体   English

弹性:在搜索期间将符号和 html 编码符号视为相同

[英]Elastic: Treat symbol and html encoded symbol the same during search

My goal is to return the same results when searching by the symbol or html encoded version.我的目标是在按符号或 html 编码版本搜索时返回相同的结果。

Example Queries:示例查询:

# searching with symbol
GET my-test-index/_search
{
  "query": {
    "bool": {
      "must": {
        "simple_query_string": {
          "query": "Hello®",
          "analyzer": "english_syn",
          "fields": [
            "AllContent"
          ]
        }
      }
    }
  }
}

# html symbol
GET my-test-index/_search
{
  "query": {
    "bool": {
      "must": {
        "simple_query_string": {
          "query": "Hello®",
          "analyzer": "english_syn",
          "fields": [
            "AllContent"
          ]
        }
      }
    }
  }
}

I've tried a couple different things.我尝试了几种不同的方法。

Adding synonyms but they still produced different results.添加同义词但它们仍然产生不同的结果。

#######################################
# Synonyms
# Symbols
#######################################
™, ™
®, ®

Created a char_filter to replace special characters so they would at least be searching for "Hello".创建了一个 char_filter 来替换特殊字符,这样他们至少会搜索“Hello”。 But that comes with its own set of issues that is out of scope of what I am trying to achieve.但这带来了自己的一系列问题,这些问题超出了我想要实现的 scope。

char_filter": {
    "specialCharactersFilter": {
    "type": "pattern_replace",
    "pattern": "[^A-Za-z0-9]",
    "replacement": " "
}

I appreciate any feedback for any new alternatives to achieve this goal.我感谢任何对实现此目标的新替代方案的反馈。 Ideally a solution that covers more than ® and ™.理想情况下,解决方案不仅涵盖 ® 和 ™。

What you are looking for is the html strip char filter , which works not only for two symbols but for a broad html characters.您正在寻找的是html strip char filter ,它不仅适用于两个符号,而且适用于广泛的 html 个字符。

Working example工作示例

Index mapping with html strip char filter使用 html strip char 过滤器进行索引映射

{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "standard",
                    "char_filter": [
                        "html_strip"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "my_analyzer"
            }
        }
    }
}

Index sample doc with just (™) in that document.在该文档中仅使用 (™) 索引示例文档。

PUT 71622637/_doc/1

{
   "title" : "™"
}

Search on its html encoded version搜索其 html 编码版本

{
    "query" :{
        "match" : {
            "title" : "&trade"
        }
    }
}

And search result

"hits": [
            {
                "_index": "71622637",
                "_id": "1",
                "_score": 0.89701396,
                "_source": {
                    "title": "™"
                }
            }
        ]

Similar to this, search on trademark symbol与此类似,搜索商标符号

{
    "query" :{
        "match" : {
            "title" : "™"
        }
    }
}

And search result

"hits": [
            {
                "_index": "71622637",
                "_id": "1",
                "_score": 0.89701396,
                "_source": {
                    "title": "™"
                }
            }
        ]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM