简体   繁体   English

有效的最近邻搜索稀疏矩阵

[英]Efficient nearest neighbour search for sparse matrices

I have a large corpus of data (text) that I have converted to a sparse term-document matrix (I am using scipy.sparse.csr.csr_matrix to store sparse matrix). 我有scipy.sparse.csr.csr_matrix数据(文本),我已经转换为稀疏的术语 - 文档矩阵(我使用scipy.sparse.csr.csr_matrix来存储稀疏矩阵)。 I want to find, for every document, top n nearest neighbour matches. 我想找到,对于每个文件,前n个最近邻居匹配。 I was hoping that NearestNeighbor routine in Python scikit-learn library ( sklearn.neighbors.NearestNeighbor to be precise) would solve my problem, but efficient algorithms that use space partitioning data structures such as KD trees or Ball trees do not work with sparse matrices. 我希望Python scikit-learn库中的NearestNeighbor例程( sklearn.neighbors.NearestNeighbor准确)可以解决我的问题,但是使用空间分区数据结构(如KD treesBall trees高效算法不适用于稀疏矩阵。 Only brute-force algorithm works with sparse matrices (which is infeasible in my case as I am dealing with large corpus). 只有蛮力算法适用于稀疏矩阵(在我处理大型语料库时,这种情况不可行)。

Is there any efficient implementation of nearest neighbour search for sparse matrices (in Python or in any other language)? 稀疏矩阵的最近邻搜索是否有效(Python或任何其他语言)?

Thanks. 谢谢。

Late answer: Have a look at Locality-Sensitive-Hashing 迟到的答案:看看Locality-Sensitive-Hashing

Support in scikit-learn has been proposed here and here . 这里这里已经提出对scikit-learn的支持。

您可以尝试使用TruncatedSVD将高维稀疏数据转换为低维密集数据,然后执行球树。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM