[英]Elasticsearch - query_string with wildcards
I have some text in elastic search containing urls in various formats ( http://www , www.) what I want to do is to search for all texts containing eg, google.com. 我在弹性搜索中有一些文本,其中包含各种格式的URL( http:// www ,www。)。我想做的就是搜索所有包含google.com的文本。
For the current search I use something like this query: 对于当前搜索,我使用类似以下查询的内容:
query = { "query": {
"bool": {
"must": [{
"range": {
"cdate": {
"gt": dfrom,
"lte": dto }
}
},
{ "query_string":{
"default_operator": "AND",
"default_field": "text",
"analyze_wildcard":"true",
"query": searchString } }
]
}
}}
But a query looking like google.com never returns any result, searching for eg, the term "test" works fine (without "). I do want to use query_string because I'd like to use boolean operators but I really need to be able to search substrings not only for whole words. 但是看起来像google.com的查询永远不会返回任何结果,例如搜索“ test”一词可以正常工作(不带“”)。我确实想使用query_string,因为我想使用布尔运算符,但我确实需要不仅可以搜索整个单词的子字符串。
Thank you ! 谢谢 !
It is true indeed that http://www.google.com
will be tokenized by the standard analyzer into http
and www.google.com
and thus google.com
will not be found. 的确,标准分析器会将
http://www.google.com
标记为http
和www.google.com
,因此不会找到google.com
。
So the standard analyzer alone will not help here, we need a token filter that will correctly transform URL tokens. 因此,仅标准分析器在此无济于事,我们需要一个令牌过滤器来正确转换URL令牌。 Another way if your
text
field only contained URLs would have been to use the UAX Email URL tokenizer , but since the field can contain any other text (ie user comments), it won't work. 如果您的
text
字段仅包含URL的另一种方式是使用UAX电子邮件URL标记器 ,但是由于该字段可以包含任何其他文本(即用户注释),因此将无法使用。
Fortunately, there's a new plugin around called analysis-url which provides an URL token filter, and this is exactly what we need (after a small modification I begged for, thanks @jlinn ;-) ) 幸运的是,有一个名为analysis-url的新插件,它提供了URL令牌过滤器,而这正是我们需要的(我恳求了一点修改 ,谢谢@jlinn ;-))
First, you need to install the plugin: 首先,您需要安装插件:
bin/plugin install https://github.com/jlinn/elasticsearch-analysis-url/releases/download/v2.2.0/elasticsearch-analysis-url-2.2.0.zip
Then, we can start playing. 然后,我们可以开始播放了。 We need to create the proper analyzer for your
text
field: 我们需要为您的
text
字段创建适当的分析器:
curl -XPUT localhost:9200/test -d '{
"settings": {
"analysis": {
"filter": {
"url_host": {
"type": "url",
"part": "host",
"url_decode": true,
"passthrough": true
}
},
"analyzer": {
"url_host": {
"filter": [
"url_host"
],
"tokenizer": "whitespace"
}
}
}
},
"mappings": {
"url": {
"properties": {
"text": {
"type": "string",
"analyzer": "url_host"
}
}
}
}
}'
With this analyzer and mapping, we can properly index the host you want to be able to search for. 使用此分析器和映射,我们可以正确索引您要搜索的主机。 For instance, let's analyze the string
blabla bla http://www.google.com blabla
using our new analyzer. 例如,让我们使用新的分析器分析字符串
blabla bla http://www.google.com blabla
。
curl -XGET 'localhost:9200/urls/_analyze?analyzer=url_host&pretty' -d 'blabla bla http://www.google.com blabla'
We'll get the following tokens: 我们将获得以下令牌:
{
"tokens" : [ {
"token" : "blabla",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
}, {
"token" : "bla",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 1
}, {
"token" : "www.google.com",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 2
}, {
"token" : "google.com",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 3
}, {
"token" : "com",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 4
}, {
"token" : "blabla",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 5
} ]
}
As you can see the http://www.google.com
part will be tokenized into: 如您所见,
http://www.google.com
部分将被标记为:
www.google.com
google.com
ie what you expected google.com
即您所期望的 com
So now if your searchString
is google.com
you'll be able to find all the documents which have a text
field containing google.com
(or www.google.com
). 因此,现在,如果您的
searchString
是google.com
您将能够找到所有text
字段包含google.com
(或www.google.com
)的文档。
Full-text search is always about exact matches in the inverted index, unless you perform a wild-card search which forces traversing the inverted index. 全文搜索始终与倒排索引中的完全匹配有关,除非您执行通配符搜索强制遍历倒排索引。 Using a wildcard at the beginning of your queryString will lead to a full-traverse of your index and is not recommended.
在queryString的开头使用通配符将导致索引的完整遍历,因此不建议这样做。
Consider not just indexing the URL, but also the domain (by stripping off protocol, subdomain and any information following the domain) applying the Keyword Tokenizer . 不仅考虑索引URL,还考虑应用关键字Tokenizer的域(通过剥离协议,子域和该域之后的任何信息)。 Then you can search the domains against this field.
然后,您可以根据此字段搜索域。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.