[英]How to apply search in Elasticsearch in python with and without whitespces
I am building a search system using ElasticSearch in python. 我正在使用Python中的ElasticSearch构建搜索系统。 I loaded a csv and created an index for search.
我加载了一个csv并创建了一个搜索索引。
from elasticsearch import helpers, Elasticsearch
import csv
es = Elasticsearch()
with open('/Users/anubhav/Office/elasticsearch-5.6.0/all_products.csv') as f:
reader = csv.DictReader(f)
helpers.bulk(es, reader, index='product-index', doc_type='product-index')
es.indices.create(
index='product-index',
body={
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit",
"whitespace"
]
}
}
}
}
},
# Will ignore 400 errors, remove to ensure you're prompted
ignore=400
)
response = es.search(
index='product-index',
body={
"query": {
"match": {
"product": "PD5MP2 price"
}
},
"aggs": {
"top_10_states": {
"terms": {
"field": "state",
"size": 10
}
}
}
}
)
print response
The csv looks something like: csv看起来像:
when I do search using: 当我使用以下方法进行搜索时:
res = es.search(index="product-index", doc_type="product-index", body={"query": {"match": {"product": "DD 350"}}})
this works fine because the exact product is there in CSV. 这可以正常工作,因为CSV中包含确切的产品。 But when I change the query to
但是当我将查询更改为
res = es.search(index="product-index", doc_type="product-index", body={"query": {"match": {"product": "DD350"}}})
It doesn't works. 它不起作用。 Can someone please help me with this?
有人可以帮我吗?
By default Elasticsearch uses Standant Tokenizer which divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. 默认情况下,Elasticsearch使用Standant Tokenizer ,它根据Unicode文本分段算法的定义,将文本划分为单词边界上的术语。 That means
DD 350
will be tokenized as DD, 350
. 这意味着
DD 350
将被标记为DD, 350
。 When you search, search keywords also will be tokenized as same tokenizer. 搜索时,搜索关键字也将被标记为相同的标记器。 So when you search for
DD 350
Elasticsearch will look for both DD
and 350
and both is in your index. 因此,当您搜索
DD 350
Elasticsearch将同时查找DD
和350
并且两者都在索引中。 But when you search for DD350
Elasticsearch won't be able to find it index. 但是,当您搜索
DD350
Elasticsearch将无法找到它的索引。 You should check the tokenizers and their usage. 您应该检查分词器及其用法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.