简体   繁体   English

使用Elasticsearch搜索确切的短语

[英]Search for exact phrase with Elasticsearch

I am currently starting out with Elasticsearch. 我目前从Elasticsearch开始。 I've indexed a few EDIFACT messages (a pre-historic data format;-) The content looks something like this: 我已经索引了几条EDIFACT消息(史前数据格式;-),内容看起来像这样:

UNB+UNOA:2+SENDER+RECEIVER+170509:0050+152538'
UNH+66304+CODECO:D:95B:UN:ITG12'
BGM+34+INGATE OF UCN ABCD+9'

When I do a search for the phrase UNH+66304+CODECO:D:95B it should only return one hit but it seems it is returning all files that contain any of these words (and UNH is in every single one of the documents). 当我搜索短语UNH + 66304 + CODECO:D:95B时,它应该只返回一击,但似乎正在返回包含这些单词中的任何一个的所有文件(并且UNH在每个文档中都存在)。 My Query is this: 我的查询是这样的:

curl -XGET --netrc-file ~/curl_user  'localhost:9200/edi/message/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query":{
        "match":{"MESSAGE":"UNH+66304+CODECO:D:95B"}
    }
}'

I've tried to add the "and" operator like this: 我试图像这样添加“和”运算符:

"match":{
              "MESSAGE":{
                "query":"UNH+66304+CODECO",
                "operator": "and"

              }
            }

But then no results are returned. 但是,没有结果返回。 I've read the suggestion here: Searching for exact phrase that I need to use double quotes. 我在这里阅读了建议: 搜索需要使用双引号的确切短语 I've tried both "query":"'UNH+66304+CODECO'" and "query":"\\"UNH+66304+CODECO\\"" but it doesn't make a difference. 我已经尝试过“ query”:“'UNH + 66304 + CODECO'”和“ query”:“ \\” UNH + 66304 + CODECO \\“”,但这并没有什么不同。

I have also tried match_phrase 我也尝试过match_phrase

"match_phrase":{
              "MESSAGE":{
                "query":"UNH+66304+CODECO"

              }
            }

does not return a result while 在不返回结果的同时

"match_phrase":{
              "MESSAGE":{
                "query":"UNH+66304"

              }
            }

does. 做。 With normal text it seems to work but somehow Elasticsearch doesn't like it with the +: etc in the search string (that is unfortunately part of EDIFACT). 对于普通文本,它似乎可以工作,但是以某种方式,Elasticsearch不喜欢在搜索字符串中使用+:等(不幸的是,它是EDIFACT的一部分)。

How to make query_string search exact phrase in ElasticSearch talks about using a different analyser if you want exact matches? 如果要精确匹配,如何在ElasticSearch中使query_string搜索精确短语谈论使用其他分析器?

Update: abhishek mishra confirmed that the Analyser is probably the way to go. 更新: abhishek mishra确认分析仪可能是解决方法。 I am using Elasticsearch 5.4 and there are a lot of Analysers to chose from: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html 我正在使用Elasticsearch 5.4,有很多分析器可供选择: https : //www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html

The Keyword Analyser would probably map to what abhishek suggested as the 'not analysed' as it is a noop Analyser. 关键字分析器可能会映射到abhishek建议的“未分析”内容,因为它是noop Analyser。 However I am a bit worried about using this as the messages can be quite long. 但是我有点担心使用此消息,因为消息可能很长。 What are the performance impacts for the search? 搜索对性能有何影响? If I use the Keyword Analyser will I still be able to search for parts of the whole message? 如果我使用关键字分析器,仍然可以搜索整个消息的一部分吗?

I am wondering whether the Pattern Analyser would be a good fit? 我想知道模式分析器是否合适? EDIFACT messages consist of segments starting with 3 Upper Case Characters and are terminated by ' (but you can escape ' by prefixing it with ?) EDIFACT消息由以3个​​大写字符开头的段组成,并以'结尾(但您可以通过在前面加上'来转义')。

FTX+AAA++It?'s a strange data format'
FTX+AAA++Yes it is'

So the example above would be two segments. 因此,上面的示例将分为两个部分。 If I would use a pattern that separates splits these segments, would that be a good match? 如果我使用将这些段分开的模式,那会很好吗?

Only problem is that currently the MESSAGE field can contain EDIFACT messages and XML messages. 唯一的问题是,当前的MESSAGE字段可以包含EDIFACT消息和XML消息。 Using the same Pattern Analyser would not work I guess so I would have to create two different types depending on the content of the MESSAGE field (all the rest is the same). 我想使用相同的模式分析器将无法正常工作,因此我将不得不根据MESSAGE字段的内容创建两种不同的类型(其余全部相同)。

2nd Update: I have followed the advice to look into analysers. 第2次更新:我已按照建议调查分析仪。 I thought the keyword analyser is probably not a good idea as the text can be quite long. 我认为关键字分析器可能不是一个好主意,因为文本可能会很长。 I've found that the pattern analyser (without any custom pattern) works quite nicely. 我发现模式分析器(没有任何自定义模式)可以很好地工作。 It splits up everything on : and +. 它拆分了:和+上的所有内容。 Searches like 搜索类似

{
    "query":{
        "match_phrase":{"MESSAGE":"RFF+ABT:ATB150538080520172452"}
    }
}

or 要么

{
        "query":{
            "match_phrase":{"MESSAGE":"RFF+ABT:ATB150538080520172452"}
        }
    }

work now. 在工作,在忙。 The problem before was that eg was split up into [rff,abt:atb150538080520172452]. 以前的问题是例如将其拆分为[rff,abt:atb150538080520172452]。

You were on the right track about the analyzer. 您对分析仪的了解是正确的。 If you look into your type mapping, the property MESSAGE is probably marked as analyzed . 如果查看类型映射,则属性MESSAGE可能会标记为analyzed This is why when indexing it's getting rid of the special characters. 这就是为什么在建立索引时会摆脱特殊字符的原因。 You need to mark it as not_analyzed . 您需要将其标记为not_analyzed

If you let us know what your type mapping looks like I can help you with the correct setting. 如果您让我们知道您的类型映射是什么样的,我可以为您提供正确的设置。

One of the examples - 示例之一-

If your ES version is < 5.0 and your type mapping looks similar to this - 如果您的ES版本低于5.0,并且类型映射与此类似,则-

{

  "MESSAGE": {
    "type" "string",
    "index": "analyzed"
  }
}

change it to 更改为

{
  "MESSAGE": {
    "type" "string",
    "index": "not_analyzed"
  }
}

The solution was to use the pattern analyser. 解决方案是使用模式分析器。 Without having to configure it further (no custom pattern specified) it breaks up the EDIFACT message along non-word/number characters. 无需进一步配置它(未指定自定义模式),它就将EDIFACT消息分解为非单词/数字字符。

The problem with the standard analyser was that it behaved odd with ':'. 标准分析仪的问题在于它的':'表现很奇怪。 So if you eg had RFF+ATB:AB12345; 因此,如果您有RFF + ATB:AB12345; it broke it up into [rff, atb:ab12345] so a search for ab12345 did not return anything. 它将其分解为[rff,atb:ab12345],因此搜索ab12345不会返回任何内容。

You can test how a analyser or tokenizer works by using 您可以使用来测试分析器或令牌生成器的工作方式

curl -XPOST --netrc-file ~/curl_user 'localhost:9200/_analyze?pretty' -H 'Content-Type: application/json' -d'
{
  "analyzer": "standard",
  "text":      "UNB+UNOA:2+SENDER+RECEIVER+170513:0452+129910165"
}'

You can replace 'analyzer' with tokenizer if you just want to test the tokenizer used. 如果您只想测试使用的令牌生成器,则可以用令牌生成器替换“分析器”。

I think you have "query" and "match_phrase" inverted: 我认为您的“查询”和“ match_phrase”倒置了:

Can you try it like this: 您可以这样尝试吗:

{
    "query": {
        "match_phrase": {
            "MESSAGE": "UNH+66304"
        }
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM