简体   繁体   English

用于在Elastic搜索中分割字符串的映射分析器

[英]Mapping analyser for splitting string in Elastic search

is it possible to create a mapping analyser for splitting string into smaller parts based on count of characters? 是否可以创建一个映射分析器,用于根据字符数将字符串分成较小的部分?

For example, let's say I have a string: "ABCD1E2F34". 例如,假设我有一个字符串:“ ABCD1E2F34”。 This is some token constructed from multiple smaller codes and I want to break it down to those codes again. 这是由多个较小的代码构成的令牌,我想再次将其分解为这些代码。

If I know for sure that: - First code is always 4 letters ("ABCD") - Second is 3 letters ("1E2") - Third is 1 letter ("F") - Fourth is 2 letters ("34") 如果我确定知道:-第一个代码始终为4个字母(“ ABCD”)-第二个代码为3个字母(“ 1E2”)-第三个为1个字母(“ F”)-第四个为2个字母(“ 34”)

Can I create a mapping analyser for a field that will map the string like this? 我可以为将这样映射字符串的字段创建映射分析器吗? If I set the field "bigCode" to have value "ABCD1E2F34" I will be able to access it like this: 如果我将字段“ bigCode”设置为值“ ABCD1E2F34”,则可以这样访问它:

bigCode.full ("ABCD1E2F34")
bigCode.first ("ABCD")
bigCode.second ("1E2")
... 

Thanks a lot! 非常感谢!

What do you think about Pattern tokenizer? 您如何看待模式令牌生成器? I create a regex to split string to tokens which is (?<=(^\\\\w{4}))|(?<=^\\\\w{4}(\\\\w{3}))|(?<=^\\\\w{4}\\\\w{3}(\\\\w{1}))|(?<=^\\\\w{4}\\\\w{3}\\\\w{1}(\\\\w{2})) . 我创建了一个正则表达式,将字符串拆分为(?<=(^\\\\w{4}))|(?<=^\\\\w{4}(\\\\w{3}))|(?<=^\\\\w{4}\\\\w{3}(\\\\w{1}))|(?<=^\\\\w{4}\\\\w{3}\\\\w{1}(\\\\w{2})) After that I created an analyzer like that: 之后,我创建了一个类似的分析器:

PUT /myindex
{
    "settings": {
        "analysis": {
          "analyzer": {
            "codeanalyzer": {
              "type": "pattern",
              "pattern":"(?<=(^\\w{4}))|(?<=^\\w{4}(\\w{3}))|(?<=^\\w{4}\\w{3}(\\w{1}))|(?<=^\\w{4}\\w{3}\\w{1}(\\w{2}))"
            }
          }
        }
    }
}

POST /myindex/_analyze?analyzer=codeanalyzer&text=ABCD1E2F34

And the result is tokenized data: 结果是标记化数据:

{
  "tokens": [
    {
      "token": "abcd",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "1e2",
      "start_offset": 4,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "f",
      "start_offset": 7,
      "end_offset": 8,
      "type": "word",
      "position": 2
    },
    {
      "token": "34",
      "start_offset": 8,
      "end_offset": 10,
      "type": "word",
      "position": 3
    }
  ]
}

You can check the documentation also : https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html 您也可以查看文档: https : //www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM