简体   繁体   中英

Elasticsearch Aggregation with hamming distance of a phash

Trying to group together similar documents with matching keyword field values and phashes of their related images. At the moment I have the following which works well for exact matching phashes

          'duplicate_docs':
        A('terms',
          script={
              "lang":
              "painless",
              "inline":
              "def term = doc['make'] + '' +doc['model'] + doc['province'] + doc['mileage'];return term+''+doc['image_hash'];"
          }),
    }, {'dup_docs': A('top_hits', size=20)}):

However some of the images are slightly different and the whole point of phash is that you can use a hamming distance to figure how different

I realise this probably makes the calculation vastly more expensive as essentially need to compare every image against every other image which seems excessive but unsure how else I could go about this. Thanks

You may want to try this out:

Mu, C, Zhao, J., Yang, G., Yang, B. and Yan, Z., 2019, October. Fast and Exact Nearest Neighbor Search in Hamming Space on Full-Text Search Engines. In International Conference on Similarity Search and Applications (pp. 49-56). Springer, Cham.

The FENSHSES method proposed by the above paper could efficiently find all r-neighbors in Hamming space w/o scanning all documents.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM