[英]Efficient nearest neighbour search for sparse matrices
I have a large corpus of data (text) that I have converted to a sparse term-document matrix (I am using scipy.sparse.csr.csr_matrix
to store sparse matrix). 我有
scipy.sparse.csr.csr_matrix
数据(文本),我已经转换为稀疏的术语 - 文档矩阵(我使用scipy.sparse.csr.csr_matrix
来存储稀疏矩阵)。 I want to find, for every document, top n nearest neighbour matches. 我想找到,对于每个文件,前n个最近邻居匹配。 I was hoping that
NearestNeighbor
routine in Python scikit-learn
library ( sklearn.neighbors.NearestNeighbor
to be precise) would solve my problem, but efficient algorithms that use space partitioning data structures such as KD trees
or Ball trees
do not work with sparse matrices. 我希望
Python scikit-learn
库中的NearestNeighbor
例程( sklearn.neighbors.NearestNeighbor
准确)可以解决我的问题,但是使用空间分区数据结构(如KD trees
或Ball trees
高效算法不适用于稀疏矩阵。 Only brute-force algorithm works with sparse matrices (which is infeasible in my case as I am dealing with large corpus). 只有蛮力算法适用于稀疏矩阵(在我处理大型语料库时,这种情况不可行)。
Is there any efficient implementation of nearest neighbour search for sparse matrices (in Python or in any other language)? 稀疏矩阵的最近邻搜索是否有效(Python或任何其他语言)?
Thanks. 谢谢。
Late answer: Have a look at Locality-Sensitive-Hashing 迟到的答案:看看Locality-Sensitive-Hashing
Support in scikit-learn has been proposed here and here . 这里和这里已经提出了对scikit-learn的支持。
您可以尝试使用TruncatedSVD将高维稀疏数据转换为低维密集数据,然后执行球树。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.