简体   繁体   中英

Find most relevant Vector from set of vectors

I have a set of vectors of shape (1000,) (Its vector representation of images), I need to find out which of them is most closely related to others(most relevant image to represent that entity). I am going through many algorithms like kNN but I don't have any trained data to compare these vectors with, I only have these vectors. Can anyone tell me which algorithm/s I need to use to achieve this??

This depends completely on the kind of embedding / how those representations are computed and there is no working algorithm without using this assumption!

You need some kind of metric, which is able to rate the similarity of two vectors!

When you do have this metric, the naive approach is a loop comparing all:

# linear-search
max_similarity = -inf
max_vector
for vector in all_vectors:
    similarity = similarity(my_vector, vector)
    if similarity > max_similarity:
        max_similarity = similarity
        max_vector = vector

For some metrics, the above can be speed up by metric-trees and similar approaches (basically the internals of kNN algorithms) which try to prune some candidates away (not looking at all candidates) using assumptions of the underlying metric (resulting in a potential speedup). These algorithms get slow in very high-dimensions, but i'm not sure if 1000 is already too much!

An example if your assumption / similarity would be based on the euclidean metric (using sklearn's KDTree ):

from sklearn.neighbors import KDTree
X = np.vstack(my_vectors)
tree = KDTree(X)
dist, ind = tree.query(my_vector, k=1)  # get nearest neighbor

One example where this would be a good approach is OpenFace which is completely build on the idea to map faces to euclidean-space (similar faces have low euclidean-distance)! (the underlying paper: FaceNet )

There is also the BallTree supporting more metrics!

from sklearn.neighbors import KDTree, BallTree

KDTree.valid_metrics
    ['cityblock', 'p', 'l2', 'chebyshev', 'l1', 'euclidean', 'minkowski', 
    'infinity', 'manhattan']

BallTree.valid_metrics
    ['braycurtis', 'cityblock', 'p', 'hamming', 'dice', 'l2', 'rogerstanimoto',
     'wminkowski', 'chebyshev', 'russellrao', 'sokalmichener', 'matching', 'l1',
     'haversine', 'pyfunc', 'kulsinski', 'seuclidean', 'mahalanobis', 'euclidean',
     'minkowski', 'sokalsneath', 'infinity', 'manhattan', 'jaccard', 'canberra']

Again: the first sentence is the most important one here!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM