简体   繁体   English

保留ElasticSearch查询中的术语顺序

[英]Preserving order of terms in ElasticSearch query

Is it possible in ElasticSearch to form a query that would preserve the ordering of the terms? 在ElasticSearch中是否可以形成一个保留条款顺序的查询?

A simple example would be having these documents indexed using standard analyzer: 一个简单的例子是使用标准分析器索引这些文档:

  1. You know for search 你知道搜索
  2. You know search 你知道搜索
  3. Know search for you 知道搜索你

I could query for +you +search and this would return me all documents, including the third one. 我可以查询+you +search ,这将返回我所有文件,包括第三个。

What if I wanted to only retrieve the documents which have the terms in this specific order? 如果我只想检索具有此特定顺序条款的文档,该怎么办? Can I form a query that would do that for me? 我可以形成一个可以帮我的查询吗?

Considering it is possible for phrases by simply quoting the text: "you know" (retrieve 1st and 2nd docs) it feels to me like there should be a way of preserving the order for multiple terms that aren't adjacent. 考虑到短语可以通过简单引用文本: "you know" (检索第一和第二个文档),我觉得应该有一种方法来保留不相邻的多个术语的顺序。

In the above simple example I could use proximity searches, but this doesn't cover more complex cases. 在上面的简单示例中,我可以使用邻近搜索,但这不包括更复杂的情况。

You could use a span_near query, it has a in_order parameter. 您可以使用span_near查询,它有一个in_order参数。

{
    "query": {
        "span_near": {
            "clauses": [
                {
                    "span_term": {
                        "field": "you"
                    }
                },
                {
                    "span_term": {
                        "field": "search"
                    }
                }
            ],
            "slop": 2,
            "in_order": true
        }
    }
}

Phrase matching doesn't ensure order ;-). 短语匹配不能确保顺序;-)。 If you specify enough slopes -like 2, for example - "hello world" will match "world hello". 如果你指定了足够的斜率 - 例如2 - “hello world”将匹配“world hello”。 But this is not necessarily a bad thing because usually searches are more relevant if two terms are "close" to each other and it doesn't matter their order. 但这并不一定是坏事,因为如果两个术语彼此“接近”并且与他们的顺序无关,通常搜索会更相关。 And I don't think authors of this feature thought of matching words that are 1000 slops apart. 我并不认为这个功能的作者会想到匹配1000个不同的单词。

There is a solution that I could find to keep the order, not simple though: using scripts. 有一个解决方案,我可以找到保持顺序,但不简单:使用脚本。 Here's one example: 这是一个例子:

POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "title": "hello world" }
{ "index": { "_id": 2 }}
{ "title": "world hello" }
{ "index": { "_id": 3 }}
{ "title": "hello term1 term2 term3 term4 world" }

POST my_index/_search
{
  "query": {
    "filtered": {
      "query": {
        "match": {
          "title": {
            "query": "hello world",
            "slop": 5,
            "type": "phrase"
          }
        }
      },
      "filter": {
        "script": {
          "script": "term1Pos=0;term2Pos=0;term1Info = _index['title'].get('hello',_POSITIONS);term2Info = _index['title'].get('world',_POSITIONS); for(pos in term1Info){term1Pos=pos.position;}; for(pos in term2Info){term2Pos=pos.position;}; return term1Pos<term2Pos;",
          "params": {}
        }
      }
    }
  }
}

To make the script itself more readable, I am rewriting here with indentations: 为了使脚本本身更具可读性,我在这里用缩进重写:

term1Pos = 0;
term2Pos = 0;
term1Info = _index['title'].get('hello',_POSITIONS);
term2Info = _index['title'].get('world',_POSITIONS);
for(pos in term1Info) {
  term1Pos = pos.position;
}; 
for(pos in term2Info) {
  term2Pos = pos.position;
}; 
return term1Pos < term2Pos;

Above is a query that searches for "hello world" with a slop of 5 which in the docs above will match all of them. 上面是一个搜索“hello world”的搜索,其中一个slop为5,在上面的文档中将匹配所有这些。 But the scripted filter will ensure that the position in document of word "hello" is lower than the position in document for word "world". 但脚本过滤器将确保单词“hello”中文档中的位置低于单词“world”中文档中的位置。 In this way, no matter how many slops we set in the query, the fact that the positions are one after the other ensures the order. 通过这种方式,无论我们在查询中设置了多少slops,这些位置是一个接一个的事实确保了订单。

This is the section in the documentation that sheds some light on the things used in the script above. 这是文档中的部分,它阐述了上面脚本中使用的内容。

This is exactly what a match_phrase query (see here ) does. 这正是match_phrase查询(参见此处 )的作用。

It checks the position of the terms, on top of their presence. 它会在存在的基础上检查条款的位置。

For example, these documents : 例如,这些文件:

POST test/values
{
  "test": "Hello World"
}

POST test/values
{
  "test": "Hello nice World"
}

POST test/values
{
  "test": "World, I don't say hello"
}

will all be found with the basic match query : 将基本match查询找到所有内容:

POST test/_search
{
  "query": {
    "match": {
      "test": "Hello World"
    }
  }
}

But using a match_phrase , only the first document will be returned : 但是使用match_phrase ,只会返回第一个文档:

POST test/_search
{
  "query": {
    "match_phrase": {
      "test": "Hello World"
    }
  }
}

{
   ...
   "hits": {
      "total": 1,
      "max_score": 2.3953633,
      "hits": [
         {
            "_index": "test",
            "_type": "values",
            "_id": "qFZAKYOTQh2AuqplLQdHcA",
            "_score": 2.3953633,
            "_source": {
               "test": "Hello World"
            }
         }
      ]
   }
}

In your case, you want to accept to have some distance between your terms . 在您的情况下,您希望接受在您的条款之间保持一定距离 This can be achieved with the slop parameter, which indicate how far you allow your terms to be one from another : 这可以通过slop参数来实现,该参数表示您允许您的术语彼此之间的距离:

POST test/_search
{
  "query": {
    "match": {
      "test": {
        "query": "Hello world",
        "slop":1,
        "type": "phrase"
      }
    }
  }
}

With this last request, you find the second document too : 在最后一个请求中,您还可以找到第二个文档:

{
   ...
   "hits": {
      "total": 2,
      "max_score": 0.38356602,
      "hits": [
         {
            "_index": "test",
            "_type": "values",
            "_id": "7mhBJgm5QaO2_aXOrTB_BA",
            "_score": 0.38356602,
            "_source": {
               "test": "Hello World"
            }
         },
         {
            "_index": "test",
            "_type": "values",
            "_id": "VKdUJSZFQNCFrxKk_hWz4A",
            "_score": 0.2169777,
            "_source": {
               "test": "Hello nice World"
            }
         }
      ]
   }
}

You can find a whole chapter about this use case in the definitive guide . 您可以在权威指南中找到关于此用例的整章。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM