简体   繁体   English

在not_analyzed字段上进行Elasticsearch通配符搜索

[英]Elasticsearch wildcard search on not_analyzed field

I have an index like following settings and mapping; 我有一个索引,如下面的设置和映射;

{
  "settings":{
     "index":{
        "analysis":{
           "analyzer":{
              "analyzer_keyword":{
                 "tokenizer":"keyword",
                 "filter":"lowercase"
              }
           }
        }
     }
  },
  "mappings":{
     "product":{
        "properties":{
           "name":{
              "analyzer":"analyzer_keyword",
              "type":"string",
              "index": "not_analyzed"
           }
        }
     }
  }
}

I am struggling with making an implementation for wildcard search on name field. 我正在努力在name字段上进行通配符搜索的实现。 My example data like this; 我的示例数据是这样的;

[
{"name": "SVF-123"},
{"name": "SVF-234"}
]

When I perform following query; 当我执行以下查询时;

http://localhost:9200/my_index/product/_search -d '
{
    "query": {
        "filtered" : {
            "query" : {
                "query_string" : {
                    "query": "*SVF-1*"
                }
            }
        }

    }
}'

It returns SVF-123 , SVF-234 . 它返回SVF-123SVF-234 I think, it still tokenizes data. 我认为,它仍然是数据的标记。 It must return only SVF-123 . 它必须只返回SVF-123

Could you please help on this? 你能帮忙吗?

Thanks in advance 提前致谢

There's a couple of things going wrong here. 这里有一些问题。

First, you are saying that you don't want terms analyzed index time. 首先,您说您不希望术语分析索引时间。 Then, there's an analyzer configured (that's used search time) that generates incompatible terms. 然后,配置了一个分析器(用于搜索时间),生成不兼容的术语。 (They are lowercased) (他们是小写的)

By default, all terms end up in the _all -field with the standard analyzer. 默认情况下,所有术语都在标准分析器的_all -field中结束。 That is where you end up searching. 那是你最终搜索的地方。 Since it tokenizes on "-", you end up with an OR of "*SVF" and "1*". 由于它在“ - ”上标记,因此最终得到“* SVF”和“1 *”的OR。

Try to do a terms facet on _all and on name to see what's going on. 尝试在_all和name上做一个术语方面,看看发生了什么。

Here's a runnable Play and gist: https://www.found.no/play/gist/3e5fcb1b4c41cfc20226 ( https://gist.github.com/alexbrasetvik/3e5fcb1b4c41cfc20226 ) 这是一个可运行的游戏和要点: https//www.found.no/play/gist/3e5fcb1b4c41cfc20226https://gist.github.com/alexbrasetvik/3e5fcb1b4c41cfc20226

You need to make sure the terms you index is compatible with what you search for. 您需要确保索引的术语与您搜索的内容兼容。 You probably want to disable _all , since it can muddy what's going on. 你可能想要禁用_all ,因为它可能会使_all发生的事情_all

#!/bin/bash

export ELASTICSEARCH_ENDPOINT="http://localhost:9200"

# Create indexes

curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
    "settings": {
        "analysis": {
            "text": [
                "SVF-123",
                "SVF-234"
            ],
            "analyzer": {
                "analyzer_keyword": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": [
                        "lowercase"
                    ]
                }
            }
        }
    },
    "mappings": {
        "type": {
            "properties": {
                "name": {
                    "type": "string",
                    "index": "not_analyzed",
                    "analyzer": "analyzer_keyword"
                }
            }
        }
    }
}'


# Index documents
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"type"}}
{"name":"SVF-123"}
{"index":{"_index":"play","_type":"type"}}
{"name":"SVF-234"}
'

# Do searches

# See all the generated terms.
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
    "facets": {
        "name": {
            "terms": {
                "field": "name"
            }
        },
        "_all": {
            "terms": {
                "field": "_all"
            }
        }
    }
}
'

# Analyzed, so no match
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
    "query": {
        "match": {
            "name": {
                "query": "SVF-123"
            }
        }
    }
}
'

# Not analyzed according to `analyzer_keyword`, so matches. (Note: term, not match)
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
    "query": {
        "term": {
            "name": {
                "value": "SVF-123"
            }
        }
    }
}
'


curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
    "query": {
        "term": {
            "_all": {
                "value": "svf"
            }
        }
    }
}
'

My solution adventure 我的解决方案冒险

I have started my case as you can see in my question. 我已经开始了我的案子,你可以在我的问题中看到。 Whenever, I have changed a part of my settings, one part started to work, but another part stop working. 每当我更改了部分设置时,一部分开始工作,但另一部分停止工作。 Let me give my solution history: 让我给出我的解决方案历史:

1.) I have indexed my data as default. 1.)我已将我的数据编入索引为默认值。 This means, my data is analyzed as default. 这意味着,我的数据被analyzed为默认值。 This will cause problem on my side. 这会引起我的问​​题。 For example; 例如;

When user started to search a keyword like SVF-1 , system run this query: 当用户开始搜索SVF-1这样的关键字时,系统会运行以下查询:

{
    "query": {
        "filtered" : {
            "query" : {
                "query_string" : {
                    "analyze_wildcard": true,
                    "query": "*SVF-1*"
                }
            }
        }

    }
}

and results; 和结果;

SVF-123
SVF-234

This is normal, because name field of my documents are analyzed . 这是正常的,因为我的文档的name字段被analyzed This splits query into tokens SVF and 1 , and SVF matches my documents, although 1 does not match. 这会将查询拆分为令牌SVF1 ,而SVF匹配我的文档,尽管1不匹配。 I have skipped this way. 我已经跳过了这种方式。 I have create a mapping for my fields make them not_analyzed 我为我的字段创建了一个映射,使它们不被not_analyzed

{
  "mappings":{
     "product":{
        "properties":{
           "name":{
              "type":"string",
              "index": "not_analyzed"
           },
           "site":{
              "type":"string",
              "index": "not_analyzed"
           } 
        }
     }
  }
}

but my problem continued. 但我的问题还在继续。

2.) I wanted to try another way after lots of research. 2.)经过大量研究,我想尝试另一种方式。 Decided to use wildcard query . 决定使用通配符查询 My query is; 我的疑问是;

{
    "query": {
        "wildcard" : {
            "name" : {
                "value" : *SVF-1*"
             }
          }
      },
            "filter":{
                    "term": {"site":"pro_en_GB"}
            }
    }
}

This query worked, but one problem here. 这个查询有效,但这里有一个问题。 My fields are not_analyzed anymore, and I am making wildcard query. 我的字段不再被分析,我正在进行通配符查询。 Case sensitivity is problem here. 区分大小写是一个问题。 If I search like svf-1 , it returns nothing. 如果我像svf-1一样搜索,它什么都不返回。 Since, user can input lowercase version of query. 因为,用户可以输入小写版本的查询。

3.) I have changed my document structure to; 3.)我已将文档结构更改为;

{
  "mappings":{
     "product":{
        "properties":{
           "name":{
              "type":"string",
              "index": "not_analyzed"
           },
           "nameLowerCase":{
              "type":"string",
              "index": "not_analyzed"
           }
           "site":{
              "type":"string",
              "index": "not_analyzed"
           } 
        }
     }
  }
}

I have adde one more field for name called nameLowerCase . 我为name nameLowerCase一个字段。 When I am indexing my document, I am setting my document like; 当我索引我的文档时,我正在设置我的文档;

{
    name: "SVF-123",
    nameLowerCase: "svf-123",
    site: "pro_en_GB"
}

Here, I am converting query keyword to lowercase and make search operation on new nameLowerCase index. 在这里,我将查询关键字转换为小写,并对新的nameLowerCase索引进行搜索操作。 And displaying name field. 并显示name字段。

Final version of my query is; 我的查询的最终版本是;

{
    "query": {
        "wildcard" : {
            "nameLowerCase" : {
                "value" : "*svf-1*"
             }
          }
      },
            "filter":{
                    "term": {"site":"pro_en_GB"}
            }
    }
}

Now it works. 现在它有效。 There is also one way to solve this problem by using multi_field . 使用multi_field还有一种方法可以解决这个问题。 My query contains dash(-), and faced some problems. 我的查询包含破折号( - ),并遇到一些问题。

Lots of thanks to @Alex Brasetvik for his detailed explanation and effort 非常感谢@Alex Brasetvik的详细解释和努力

Adding to Hüseyin answer, we can use AND as the default operator. 添加到Hüseyin答案,我们可以使用AND作为默认运算符。 So SVF and 1* will be joined using AND operator, therefore giving us the correct results. 因此SVF和1 *将使用AND运算符连接,因此为我们提供了正确的结果。

"query": {
    "filtered" : {
        "query" : {
            "query_string" : {
                "default_operator": "AND",
                "analyze_wildcard": true,
                "query": "*SVF-1*"
            }
        }
    }
}

@Viduranga Wijesooriya as you stated "default_operator" : "AND" will check for presence of both SVF and 1 but exact match alone is still not possible, but ya this will filter the results in more appropriate way leaving with all combination of SVF and 1 and sorting the results by relevance which will promote SVF-1 up the order @Viduranga Wijesooriya如你所说的"default_operator" : "AND"将检查是否存在SVF和1,但单独的完全匹配仍然是不可能的,但是这将以更合适的方式过滤结果,留下SVF和1的所有组合并通过相关性对结果进行排序,这将促进SVF​​-1向上发展

For pulling out the exact result 为了取出确切的结果

"settings": {
        "analysis": {
            "analyzer": {
                "analyzer_keyword": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": [
                        "lowercase"
                    ]
                }
            }
        }
    },
    "mappings": {
        "type": {
            "properties": {
                "name": {
                    "type": "string",
                    "analyzer": "analyzer_keyword"
                }
            }
        }
    }

and the query is 而查询是

{
    "query": {
        "bool": {
            "must": [
               {
                    "query_string" : {
                        "fields": ["name"],
                        "query" : "*svf-1*",
                        "analyze_wildcard": true
                    }
               }
            ]
        }
    }
}

result 结果

{
   "took": 4,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1,
      "hits": [
         {
            "_index": "play",
            "_type": "type",
            "_id": "AVfXzn3oIKphDu1OoMtF",
            "_score": 1,
            "_source": {
               "name": "SVF-123"
            }
         }
      ]
   }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM