简体   繁体   中英

How to combine multiple queries?

I have millions of documents to index. Each document has fields doc_id , doc_title and several fields doc_content .

import requests

index = 'test'

JSON = {
    "mappings": {
        "properties": {
            "doc_id":      {"type": "keyword"},
            "doc_title":   {"type": "text"   },
            "doc_content": {"type": "text"   }
        }
    }
}

r = requests.put(f'http://127.0.0.1:9200/{index}', json=JSON)

To minimize the size of the index, I keep doc_title and doc_content separate.

docs = [
    {"doc_id": 1, "doc_title": "good"},
    {"doc_id": 1, "doc_content": "a"},
    {"doc_id": 1, "doc_content": "b"},

    {"doc_id": 2, "doc_title": "good"},
    {"doc_id": 2, "doc_content": "c"},
    {"doc_id": 2, "doc_content": "d"},

    {"doc_id": 3, "doc_title": "bad"},
    {"doc_id": 3, "doc_content": "a"},
    {"doc_id": 3, "doc_content": "e"}
]

for doc in docs:
    r = requests.post(f'http://127.0.0.1:9200/{index}/_doc', json=doc)

query_1:

JSON = {
    "query": {
        "match": {
            "doc_title": "good"
        }
    }
}

r = requests.get(f'http://127.0.0.1:9200/{index}/_search', json=JSON)

[x['_source'] for x in r.json()['hits']['hits']]

[{'doc_id': 1, 'doc_title': 'good'}, {'doc_id': 2, 'doc_title': 'good'}]

query_2:

JSON = {
    "query": {
        "match": {
            "doc_content": "a"
        }
    }
}

r = requests.get(f'http://127.0.0.1:9200/{index}/_search', json=JSON)

[x['_source'] for x in r.json()['hits']['hits']]

[{'doc_id': 1, 'doc_content': 'a'}, {'doc_id': 3, 'doc_content': 'a'}]

How to combine query_1 and query_2?

I need something like this:

JSON = {
    "query": {
        "bool": {
            "must": [
                {"match": {"doc_title": "good"}},
                {"match": {"doc_content": "a"}}
            ]
        }
    }
}

r = requests.get(f'http://127.0.0.1:9200/{index}/_search', json=JSON)

[x['_source'] for x in r.json()['hits']['hits']]

[]

Desired result:

[{'doc_id': 1, 'doc_title': 'good', 'doc_content': 'a'}]

It's bad practice to separate doc_title & doc_content -- you're not really miniming anything.

Go with this:

docs = [
    {"doc_id": 1, "doc_title": "good", "doc_content": ["a", "b"]},
    {"doc_id": 2, "doc_title": "good", "doc_content": ["c", "d"]},
    {"doc_id": 3, "doc_title": "bad", "doc_content": ["a", "e"]}
]

for doc in docs:
    r = requests.post(f'http://127.0.0.1:9200/{index}/_doc', json=doc)

and your query will work just as expected. a and b are supposed to be shared by doc_id=1 anyways, aren't they?


UPDATE -- make the contents syntactically nested

PUT test
{
  "mappings": {
      "properties": {
        "contents": {
          "type": "nested",
          "properties": {
            "doc_content": {
              "type": "text"
            }
          }
        },
        "doc_id": {
          "type": "keyword"
        },
        "doc_title": {
          "type": "text"
        }
      }

  }
}

POST test/_doc
{
  "doc_id": 1,
  "doc_title": "good",
  "contents": [
    {"doc_content": "a"},
    {"doc_content": "b"}
  ]
}

GET test/_search
{
  "_source": ["doc_title", "inner_hits"], 
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "doc_title": "good"
          }
        },
        {
          "nested": {
            "path": "contents",
            "query": {
              "match": {
                "contents.doc_content": "a"
              }
            },
            "inner_hits": {}
          }
        }
      ]
    }
  }
}

yielding

[
  {
    "_index":"test",
    "_type":"_doc",
    "_id":"sySOoXEBdiyDG0RsIq21",
    "_score":0.98082924,
    "_source":{
      "doc_title":"good"               <------
    },
    "inner_hits":{
      "contents":{
        "hits":{
          "total":1,
          "max_score":0.6931472,
          "hits":[
            {
              "_index":"test",
              "_type":"_doc",
              "_id":"sySOoXEBdiyDG0RsIq21",
              "_nested":{
                "field":"contents",
                "offset":0
              },
              "_score":0.6931472,
              "_source":{
                "doc_content":"a"          <-----
              }
            }
          ]
        }
      }
    }
  }
]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM