简体   繁体   中英

Search for exact phrase with Elasticsearch

I am currently starting out with Elasticsearch. I've indexed a few EDIFACT messages (a pre-historic data format;-) The content looks something like this:

UNB+UNOA:2+SENDER+RECEIVER+170509:0050+152538'
UNH+66304+CODECO:D:95B:UN:ITG12'
BGM+34+INGATE OF UCN ABCD+9'

When I do a search for the phrase UNH+66304+CODECO:D:95B it should only return one hit but it seems it is returning all files that contain any of these words (and UNH is in every single one of the documents). My Query is this:

curl -XGET --netrc-file ~/curl_user  'localhost:9200/edi/message/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query":{
        "match":{"MESSAGE":"UNH+66304+CODECO:D:95B"}
    }
}'

I've tried to add the "and" operator like this:

"match":{
              "MESSAGE":{
                "query":"UNH+66304+CODECO",
                "operator": "and"

              }
            }

But then no results are returned. I've read the suggestion here: Searching for exact phrase that I need to use double quotes. I've tried both "query":"'UNH+66304+CODECO'" and "query":"\\"UNH+66304+CODECO\\"" but it doesn't make a difference.

I have also tried match_phrase

"match_phrase":{
              "MESSAGE":{
                "query":"UNH+66304+CODECO"

              }
            }

does not return a result while

"match_phrase":{
              "MESSAGE":{
                "query":"UNH+66304"

              }
            }

does. With normal text it seems to work but somehow Elasticsearch doesn't like it with the +: etc in the search string (that is unfortunately part of EDIFACT).

How to make query_string search exact phrase in ElasticSearch talks about using a different analyser if you want exact matches?

Update: abhishek mishra confirmed that the Analyser is probably the way to go. I am using Elasticsearch 5.4 and there are a lot of Analysers to chose from: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html

The Keyword Analyser would probably map to what abhishek suggested as the 'not analysed' as it is a noop Analyser. However I am a bit worried about using this as the messages can be quite long. What are the performance impacts for the search? If I use the Keyword Analyser will I still be able to search for parts of the whole message?

I am wondering whether the Pattern Analyser would be a good fit? EDIFACT messages consist of segments starting with 3 Upper Case Characters and are terminated by ' (but you can escape ' by prefixing it with ?)

FTX+AAA++It?'s a strange data format'
FTX+AAA++Yes it is'

So the example above would be two segments. If I would use a pattern that separates splits these segments, would that be a good match?

Only problem is that currently the MESSAGE field can contain EDIFACT messages and XML messages. Using the same Pattern Analyser would not work I guess so I would have to create two different types depending on the content of the MESSAGE field (all the rest is the same).

2nd Update: I have followed the advice to look into analysers. I thought the keyword analyser is probably not a good idea as the text can be quite long. I've found that the pattern analyser (without any custom pattern) works quite nicely. It splits up everything on : and +. Searches like

{
    "query":{
        "match_phrase":{"MESSAGE":"RFF+ABT:ATB150538080520172452"}
    }
}

or

{
        "query":{
            "match_phrase":{"MESSAGE":"RFF+ABT:ATB150538080520172452"}
        }
    }

work now. The problem before was that eg was split up into [rff,abt:atb150538080520172452].

You were on the right track about the analyzer. If you look into your type mapping, the property MESSAGE is probably marked as analyzed . This is why when indexing it's getting rid of the special characters. You need to mark it as not_analyzed .

If you let us know what your type mapping looks like I can help you with the correct setting.

One of the examples -

If your ES version is < 5.0 and your type mapping looks similar to this -

{

  "MESSAGE": {
    "type" "string",
    "index": "analyzed"
  }
}

change it to

{
  "MESSAGE": {
    "type" "string",
    "index": "not_analyzed"
  }
}

The solution was to use the pattern analyser. Without having to configure it further (no custom pattern specified) it breaks up the EDIFACT message along non-word/number characters.

The problem with the standard analyser was that it behaved odd with ':'. So if you eg had RFF+ATB:AB12345; it broke it up into [rff, atb:ab12345] so a search for ab12345 did not return anything.

You can test how a analyser or tokenizer works by using

curl -XPOST --netrc-file ~/curl_user 'localhost:9200/_analyze?pretty' -H 'Content-Type: application/json' -d'
{
  "analyzer": "standard",
  "text":      "UNB+UNOA:2+SENDER+RECEIVER+170513:0452+129910165"
}'

You can replace 'analyzer' with tokenizer if you just want to test the tokenizer used.

I think you have "query" and "match_phrase" inverted:

Can you try it like this:

{
    "query": {
        "match_phrase": {
            "MESSAGE": "UNH+66304"
        }
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM