简体   繁体   English

Elasticsearch-带通配符的query_string

[英]Elasticsearch - query_string with wildcards

I have some text in elastic search containing urls in various formats ( http://www , www.) what I want to do is to search for all texts containing eg, google.com. 我在弹性搜索中有一些文本,其中包含各种格式的URL( http:// www ,www。)。我想做的就是搜索所有包含google.com的文本。

For the current search I use something like this query: 对于当前搜索,我使用类似以下查询的内容:

query = { "query": {
                "bool": {
                     "must": [{
                            "range": {
                            "cdate": {
                                "gt": dfrom,
                                "lte": dto }
                            }
                        },
             { "query_string":{
                "default_operator": "AND",
                "default_field": "text",
                "analyze_wildcard":"true",
                "query": searchString } }
            ]
        }
        }}

But a query looking like google.com never returns any result, searching for eg, the term "test" works fine (without "). I do want to use query_string because I'd like to use boolean operators but I really need to be able to search substrings not only for whole words. 但是看起来像google.com的查询永远不会返回任何结果,例如搜索“ test”一词可以正常工作(不带“”)。我确实想使用query_string,因为我想使用布尔运算符,但我确实需要不仅可以搜索整个单词的子字符串。

Thank you ! 谢谢 !

It is true indeed that http://www.google.com will be tokenized by the standard analyzer into http and www.google.com and thus google.com will not be found. 的确,标准分析器会将http://www.google.com标记为httpwww.google.com ,因此不会找到google.com

So the standard analyzer alone will not help here, we need a token filter that will correctly transform URL tokens. 因此,仅标准分析器在此无济于事,我们需要一个令牌过滤器来正确转换URL令牌。 Another way if your text field only contained URLs would have been to use the UAX Email URL tokenizer , but since the field can contain any other text (ie user comments), it won't work. 如果您的text字段仅包含URL的另一种方式是使用UAX电子邮件URL标记器 ,但是由于该字段可以包含任何其他文本(即用户注释),因此将无法使用。

Fortunately, there's a new plugin around called analysis-url which provides an URL token filter, and this is exactly what we need (after a small modification I begged for, thanks @jlinn ;-) ) 幸运的是,有一个名为analysis-url的新插件,它提供了URL令牌过滤器,而这正是我们需要的(我恳求了一点修改 ,谢谢@jlinn ;-))

First, you need to install the plugin: 首先,您需要安装插件:

bin/plugin install https://github.com/jlinn/elasticsearch-analysis-url/releases/download/v2.2.0/elasticsearch-analysis-url-2.2.0.zip

Then, we can start playing. 然后,我们可以开始播放了。 We need to create the proper analyzer for your text field: 我们需要为您的text字段创建适当的分析器:

curl -XPUT localhost:9200/test -d '{
  "settings": {
    "analysis": {
      "filter": {
        "url_host": {
          "type": "url",
          "part": "host",
          "url_decode": true,
          "passthrough": true
        }
      },
      "analyzer": {
        "url_host": {
          "filter": [
            "url_host"
          ],
          "tokenizer": "whitespace"
        }
      }
    }
  },
  "mappings": {
    "url": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "url_host"
        }
      }
    }
  }
}'

With this analyzer and mapping, we can properly index the host you want to be able to search for. 使用此分析器和映射,我们可以正确索引您要搜索的主机。 For instance, let's analyze the string blabla bla http://www.google.com blabla using our new analyzer. 例如,让我们使用新的分析器分析字符串blabla bla http://www.google.com blabla

curl -XGET 'localhost:9200/urls/_analyze?analyzer=url_host&pretty' -d 'blabla bla http://www.google.com blabla'

We'll get the following tokens: 我们将获得以下令牌:

{
  "tokens" : [ {
    "token" : "blabla",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "bla",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "www.google.com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "google.com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "blabla",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 5
  } ]
}

As you can see the http://www.google.com part will be tokenized into: 如您所见, http://www.google.com部分将被标记为:

  • www.google.com
  • google.com ie what you expected google.com即您所期望的
  • com

So now if your searchString is google.com you'll be able to find all the documents which have a text field containing google.com (or www.google.com ). 因此,现在,如果您的searchStringgoogle.com您将能够找到所有text字段包含google.com (或www.google.com )的文档。

Full-text search is always about exact matches in the inverted index, unless you perform a wild-card search which forces traversing the inverted index. 全文搜索始终与倒排索引中的完全匹配有关,除非您执行通配符搜索强制遍历倒排索引。 Using a wildcard at the beginning of your queryString will lead to a full-traverse of your index and is not recommended. 在queryString的开头使用通配符将导致索引的完整遍历,因此不建议这样做。

Consider not just indexing the URL, but also the domain (by stripping off protocol, subdomain and any information following the domain) applying the Keyword Tokenizer . 不仅考虑索引URL,还考虑应用关键字Tokenizer的域(通过剥离协议,子域和该域之后的任何信息)。 Then you can search the domains against this field. 然后,您可以根据此字段搜索域。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM