I have millions of documents to index. Each document has fields doc_id
, doc_title
and several fields doc_content
.
import requests
index = 'test'
JSON = {
"mappings": {
"properties": {
"doc_id": {"type": "keyword"},
"doc_title": {"type": "text" },
"doc_content": {"type": "text" }
}
}
}
r = requests.put(f'http://127.0.0.1:9200/{index}', json=JSON)
To minimize the size of the index, I keep doc_title
and doc_content
separate.
docs = [
{"doc_id": 1, "doc_title": "good"},
{"doc_id": 1, "doc_content": "a"},
{"doc_id": 1, "doc_content": "b"},
{"doc_id": 2, "doc_title": "good"},
{"doc_id": 2, "doc_content": "c"},
{"doc_id": 2, "doc_content": "d"},
{"doc_id": 3, "doc_title": "bad"},
{"doc_id": 3, "doc_content": "a"},
{"doc_id": 3, "doc_content": "e"}
]
for doc in docs:
r = requests.post(f'http://127.0.0.1:9200/{index}/_doc', json=doc)
query_1:
JSON = {
"query": {
"match": {
"doc_title": "good"
}
}
}
r = requests.get(f'http://127.0.0.1:9200/{index}/_search', json=JSON)
[x['_source'] for x in r.json()['hits']['hits']]
[{'doc_id': 1, 'doc_title': 'good'}, {'doc_id': 2, 'doc_title': 'good'}]
query_2:
JSON = {
"query": {
"match": {
"doc_content": "a"
}
}
}
r = requests.get(f'http://127.0.0.1:9200/{index}/_search', json=JSON)
[x['_source'] for x in r.json()['hits']['hits']]
[{'doc_id': 1, 'doc_content': 'a'}, {'doc_id': 3, 'doc_content': 'a'}]
How to combine query_1 and query_2?
I need something like this:
JSON = {
"query": {
"bool": {
"must": [
{"match": {"doc_title": "good"}},
{"match": {"doc_content": "a"}}
]
}
}
}
r = requests.get(f'http://127.0.0.1:9200/{index}/_search', json=JSON)
[x['_source'] for x in r.json()['hits']['hits']]
[]
Desired result:
[{'doc_id': 1, 'doc_title': 'good', 'doc_content': 'a'}]
It's bad practice to separate doc_title
& doc_content
-- you're not really miniming anything.
Go with this:
docs = [
{"doc_id": 1, "doc_title": "good", "doc_content": ["a", "b"]},
{"doc_id": 2, "doc_title": "good", "doc_content": ["c", "d"]},
{"doc_id": 3, "doc_title": "bad", "doc_content": ["a", "e"]}
]
for doc in docs:
r = requests.post(f'http://127.0.0.1:9200/{index}/_doc', json=doc)
and your query will work just as expected. a
and b
are supposed to be shared by doc_id=1
anyways, aren't they?
UPDATE -- make the contents
syntactically nested
PUT test
{
"mappings": {
"properties": {
"contents": {
"type": "nested",
"properties": {
"doc_content": {
"type": "text"
}
}
},
"doc_id": {
"type": "keyword"
},
"doc_title": {
"type": "text"
}
}
}
}
POST test/_doc
{
"doc_id": 1,
"doc_title": "good",
"contents": [
{"doc_content": "a"},
{"doc_content": "b"}
]
}
GET test/_search
{
"_source": ["doc_title", "inner_hits"],
"query": {
"bool": {
"must": [
{
"match": {
"doc_title": "good"
}
},
{
"nested": {
"path": "contents",
"query": {
"match": {
"contents.doc_content": "a"
}
},
"inner_hits": {}
}
}
]
}
}
}
yielding
[
{
"_index":"test",
"_type":"_doc",
"_id":"sySOoXEBdiyDG0RsIq21",
"_score":0.98082924,
"_source":{
"doc_title":"good" <------
},
"inner_hits":{
"contents":{
"hits":{
"total":1,
"max_score":0.6931472,
"hits":[
{
"_index":"test",
"_type":"_doc",
"_id":"sySOoXEBdiyDG0RsIq21",
"_nested":{
"field":"contents",
"offset":0
},
"_score":0.6931472,
"_source":{
"doc_content":"a" <-----
}
}
]
}
}
}
}
]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.