简体   繁体   English

如何在带有和不带有白色空格的python中的Elasticsearch中应用搜索

[英]How to apply search in Elasticsearch in python with and without whitespces

I am building a search system using ElasticSearch in python. 我正在使用Python中的ElasticSearch构建搜索系统。 I loaded a csv and created an index for search. 我加载了一个csv并创建了一个搜索索引。

from elasticsearch import helpers, Elasticsearch
import csv

es = Elasticsearch()

with open('/Users/anubhav/Office/elasticsearch-5.6.0/all_products.csv') as f:
    reader = csv.DictReader(f)
    helpers.bulk(es, reader, index='product-index', doc_type='product-index')

es.indices.create(
    index='product-index',
    body={
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "my_tokenizer"
            }
          },
          "tokenizer": {
            "my_tokenizer": {
              "type": "edge_ngram",
              "min_gram": 2,
              "max_gram": 10,
              "token_chars": [
                "letter",
                "digit",
                "whitespace"
              ]
            }
          }
        }
      }
    },
    # Will ignore 400 errors, remove to ensure you're prompted
    ignore=400
)

response = es.search(
index='product-index',
body={
    "query": {
        "match": {
            "product": "PD5MP2 price"
        }
    },
    "aggs": {
        "top_10_states": {
            "terms": {
                "field": "state",
                "size": 10
            }
        }
    }
}
)

print response

The csv looks something like: csv看起来像: CSV样本

when I do search using: 当我使用以下方法进行搜索时:

res = es.search(index="product-index", doc_type="product-index", body={"query": {"match": {"product": "DD 350"}}})

this works fine because the exact product is there in CSV. 这可以正常工作,因为CSV中包含确切的产品。 But when I change the query to 但是当我将查询更改为

res = es.search(index="product-index", doc_type="product-index", body={"query": {"match": {"product": "DD350"}}})

It doesn't works. 它不起作用。 Can someone please help me with this? 有人可以帮我吗?

By default Elasticsearch uses Standant Tokenizer which divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. 默认情况下,Elasticsearch使用Standant Tokenizer ,它根据Unicode文本分段算法的定义,将文本划分为单词边界上的术语。 That means DD 350 will be tokenized as DD, 350 . 这意味着DD 350将被标记为DD, 350 When you search, search keywords also will be tokenized as same tokenizer. 搜索时,搜索关键字也将被标记为相同的标记器。 So when you search for DD 350 Elasticsearch will look for both DD and 350 and both is in your index. 因此,当您搜索DD 350 Elasticsearch将同时查找DD350并且两者都在索引中。 But when you search for DD350 Elasticsearch won't be able to find it index. 但是,当您搜索DD350 Elasticsearch将无法找到它的索引。 You should check the tokenizers and their usage. 您应该检查分词器及其用法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM