简体   繁体   English

elasticsearch 中的 uax_url_email tokenizer 为具有特殊字符的电子邮件生成多个令牌

[英]uax_url_email tokenizer in elasticsearch generates multiple tokens for emails with special characters

I use uax_url_email tokenizer for email fields in our index.我对索引中的 email 字段使用 uax_url_email 分词器。 It works perfect and generates single token for normal emails like johndoe@yahoo.com.它工作完美并为普通电子邮件生成单个令牌,如 johndoe@yahoo.com。 However it generates multiple tokens when the email has foreign or special characters.但是,当 email 包含外来字符或特殊字符时,它会生成多个标记。 Is there a solution for this?有解决办法吗? I don't want multiple tokens generated我不想生成多个令牌

PUT email-test-index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "email_analyzer": {
            "filter": ["lowercase"],
            "tokenizer": "email_tokenizer"
          }
        },
        "tokenizer": {
          "email_tokenizer": {
            "type": "uax_url_email"
          }
        }
      }
    }
  },
  "mappings": {
    "date_detection": false,
    "numeric_detection": false,
    "properties": {
      "EMAIL": {
        "type": "text",
        "store": true,
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        },
        "analyzer": "email_analyzer"
      }
    }
  }
}

When it work:当它工作时:

GET email-test-index/_analyze
{
  "field": "EMAIL",
  "text": "johndoe@yahoo.com"
}

{
  "tokens" : [
    {
      "token" : "johndoe@yahoo.com",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "<EMAIL>",
      "position" : 0
    }
  ]
}

When it does not work:当它不起作用时:

GET email-test-index/_analyze
{
  "field": "EMAIL",
  "text": "johndoeó8@yahoo.com"
}

{
  "tokens" : [
    {
      "token" : "johndoeó8",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "yahoo.com",
      "start_offset" : 10,
      "end_offset" : 19,
      "type" : "<URL>",
      "position" : 1
    }
  ]
}

Tldr; Tldr;

You can not without getting rid of the special char.你不能不摆脱特殊的字符。 I might be wrong but I don't think such char are even allowed by the email standard.我可能是错的,但我认为 email 标准甚至不允许这样的字符。

Solution解决方案

You could be using the mapping char filter and catch all non ascii char at map them to ascii.您可以使用映射字符过滤器并将 map 处的所有非 ascii 字符捕获为 ascii。

POST _analyze
{
  "tokenizer": "uax_url_email",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "ó => o"
      ]
    }
  ],
  "text": "Email me at johndóe8@yahoo.com"
}

{
  "tokens": [
    {
      "token": "Email",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "me",
      "start_offset": 6,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "at",
      "start_offset": 9,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "johndoe8@yahoo.com",
      "start_offset": 12,
      "end_offset": 30,
      "type": "<EMAIL>",
      "position": 3
    }
  ]
}

Notice that the ó has been replaced by 'o'请注意, ó已被替换为“o”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM