简体   繁体   中英

Question about Elasticsearch schema and query

I am setting up an Elasticsearch cluster for searching vectors associated to an id.

For example,

Given this data:

Parent id / Object id / vectors
P1 / BD / 123, 125, 235 ... 10304, 50305 
P1 / DF / 125, 235, 240 ... 10305, 10306
P1 / ED / 123, 235, 350 ... 10010, 10344
... 
P2 / AB / 125, 535, 740 ... 9315, 10306
P2 / VC / 133, 435, 350 ... 3010, 20344
P2 / RF / 113, 353, 390 ... 10110, 30344
...
There are millions of parents
hundreds of objects in a parent
1000 vectors in an object

So basically I want to

  1. index all of the vectors
  2. given input P999, search for similar parents from the cluster by finding the most number of similar objects. (similar objects: at least 50 vector matches)

Here's a sample result I expect

Input:
P999 / HH / xxx, xxx ...
P999 / YH / xxx, xxx ...
P999 / GJ / xxx, xxx ...
...
Output:
[result sorted desc] 
P20 has 60 similar objects
P4 has 45 similar objects
P501 has 41 similar objects
...

similar objects: at least 50 vector matches

To achieve this, I need

  1. Good schema
  2. A query that stores vectors
  3. A query that searches a list of similar objects in desc order

And I need some helps on these three.

  1. Schema
curl -XPOST url/vectors -d '{
  "mappings" : {
    "properties": {
      "object_id":{"type":"text"},
      "parent_id":{"type":"text"},
      "vectors":{"type":"text"}
    }
  }
}'
  1. insert query
curl -XPUT url/vectors -d '{
  "parent_id":"P1",
  "object_id":"BD",
  "vectors":"123, 125, 235 ... 10304, 50305"}
}'
  1. search query
curl -XGET url/vectors -d '{
  "size":10000,
  "query" {
    "function_score": {
      "functions": [
        {
          ???        
        }
      ],
      "query": {
        "bool": {
          "should": [
            { "terms"{"vectors":["111"] },
            { "terms"{"vectors":["222"] },
            ...
            { "terms"{"vectors":["333"] },
            { "terms"{"vectors":["444"] }
          ]         
        }      
      },
      "minimum_should_match": "50",
    }
  },
  "from": 0,
  "sort": 
  [
    {
      "_score": {
        "order": "desc"
      }
    }
  ]
}'

And my questions are

  1. In my schema mapping, is this a right way to store vectors?
  2. In my search query, I need some help on [???] part to get the expected results. And I am not even sure I am on the right track. Would you correct my query if wrong?

Thanks

I doubt you can get the desired output using pure elasticsearch query.

what i would do is have a python script that would be able to programatically change the vectors being searched for. and then depending on how big the response is going to be you might need to use the Scan API in order to return all the matches your end query would look something like this

"query" : {
    "bool" : {
        "should" : [
            //THIS IS THE PART THAT YOU PROGRAMATICALLY FILL USING THE VECTORS FROM THE PARENT YOU SPECIFIED
            {"match" : {"vector" : "111"}},
            {"match" : {"vector" : "222"}},
            {"match" : {"vector" : "333"}},
            ...
            {"match" : {"vector" : "444"}},
        ],
      "minimum_should_match": "50"
    }
}

then you would using python determine the number of matching vectors between P999 and all of the matches

is there a reason that you don't use a graph database? these kind of relationships would be a lot easier and faster to find using a graph database.

if you have to use function score I would add this to the query above.

what it should do is add a weight for every matching document, however I'm fairly certain that it will add that the query itself will do a pretty good job of scoring the documents

        "function_score": {
          "query": { "match_all": {} },
          "boost": "5", 
          "functions": [
              {
                  "filter": { "match": { "vector": "111" } }, 
                  "weight": 1
              },
              {
                  "filter": { "match": { "vector": "222" } },
                  "weight": 1
              }
              ...
          ],
          "max_boost": 1,
          "score_mode": "max",
          "boost_mode": "replace",
          "min_score" : 0
        }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM