elasticsearch 中的 uax_url_email tokenizer 为具有特殊字符的电子邮件生成多个令牌

Question

I use uax_url_email tokenizer for email fields in our index.我对索引中的 email 字段使用 uax_url_email 分词器。 It works perfect and generates single token for normal emails like johndoe@yahoo.com.它工作完美并为普通电子邮件生成单个令牌，如 johndoe@yahoo.com。 However it generates multiple tokens when the email has foreign or special characters.但是，当 email 包含外来字符或特殊字符时，它会生成多个标记。 Is there a solution for this?有解决办法吗？ I don't want multiple tokens generated我不想生成多个令牌

PUT email-test-index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "email_analyzer": {
            "filter": ["lowercase"],
            "tokenizer": "email_tokenizer"
          }
        },
        "tokenizer": {
          "email_tokenizer": {
            "type": "uax_url_email"
          }
        }
      }
    }
  },
  "mappings": {
    "date_detection": false,
    "numeric_detection": false,
    "properties": {
      "EMAIL": {
        "type": "text",
        "store": true,
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        },
        "analyzer": "email_analyzer"
      }
    }
  }
}

When it work:当它工作时：

GET email-test-index/_analyze
{
  "field": "EMAIL",
  "text": "johndoe@yahoo.com"
}

{
  "tokens" : [
    {
      "token" : "johndoe@yahoo.com",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "<EMAIL>",
      "position" : 0
    }
  ]
}

When it does not work:当它不起作用时：

GET email-test-index/_analyze
{
  "field": "EMAIL",
  "text": "johndoeó8@yahoo.com"
}

{
  "tokens" : [
    {
      "token" : "johndoeó8",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "yahoo.com",
      "start_offset" : 10,
      "end_offset" : 19,
      "type" : "<URL>",
      "position" : 1
    }
  ]
}

Answer 1

Tldr; Tldr；

You can not without getting rid of the special char.你不能不摆脱特殊的字符。 I might be wrong but I don't think such char are even allowed by the email standard.我可能是错的，但我认为 email 标准甚至不允许这样的字符。

Solution解决方案

You could be using the mapping char filter and catch all non ascii char at map them to ascii.您可以使用映射字符过滤器并将 map 处的所有非 ascii 字符捕获为 ascii。

POST _analyze
{
  "tokenizer": "uax_url_email",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "ó => o"
      ]
    }
  ],
  "text": "Email me at johndóe8@yahoo.com"
}

{
  "tokens": [
    {
      "token": "Email",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "me",
      "start_offset": 6,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "at",
      "start_offset": 9,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "johndoe8@yahoo.com",
      "start_offset": 12,
      "end_offset": 30,
      "type": "<EMAIL>",
      "position": 3
    }
  ]
}

Notice that the ó has been replaced by 'o'请注意， ó已被替换为“o”

elasticsearch 中的 uax_url_email tokenizer 为具有特殊字符的电子邮件生成多个令牌

问题描述

1 个解决方案

解决方案1
0 2023-01-05 16:10:49

Tldr; Tldr；

Solution解决方案

elasticsearch 中的 uax_url_email tokenizer 为具有特殊字符的电子邮件生成多个令牌

问题描述

1 个解决方案

解决方案1 0 2023-01-05 16:10:49

Tldr; Tldr；

Solution解决方案

解决方案1
0 2023-01-05 16:10:49