简体   繁体   中英

Is there a way to find the n most distant vectors in an array?

I have an array of thousands of doc2vec vectors with 90 dimensions. For my current purposes I would like to find a way to "sample" the different regions of this vector space, to get a sense of the diversity of the corpus. For example, I would like to partition my space into n regions, and get the most relevant word vectors for each of these regions.

I've tried clustering with hdbscan (after reducing the dimensionality with UMAP) to carve the vector space at its natural joints, but it really doesn't work well.

So now I'm wondering whether there is a way to sample the "far out regions" of the space (n vectors that are most distant from each other).

  1. Would that be a good strategy?
  2. How could I do this?

Many thanks in advance!

Wouldn't a random sample from all vectors necessarily encounter any of the various 'regions' in the set?

If there are "natural joints" and clusters to the documents, some clustering algorithm should be able to find the N clusters, then the smaller number of NxN distances between each cluster's centroid to each other cluster's centroid might identify those "furthest out" clusters.

Note for any vector, you can use the Doc2Vec doc-vectors most_similar() with a topn value of 0 /false-ish to get the (unsorted) similarities to all other model doc-vectors. You could then find the least-similar vectors in that set. If your dataset is small enough for it to be practical to do this for "all" (or some large sampling) of doc-vectors, then perhaps other docs that appear in the "bottom N" least-similar, for the most number of other vectors, would be the most "far out".

Whether this idea of "far out" is actually shown in the data, or useful, isn't clear. (In high-dimensional spaces, everything can be quite "far" from everything else in ways that don't match our 2d/3d intuitions, and slight differences in some vectors being a little "further" might not correspond to useful distinctions.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM