计算pandas帧列组合之间距离的有效方法

Question

任务

我有一个pandas数据框，其中：

列是文档名称
行是这些文档中的单词
帧单元格内的数字是单词相关性的度量（单词计数，如果你想保持简单）

我需要计算一个新的doc1-doc相似矩阵，其中：

行和列是文档名称
帧内的单元格是两个文档之间的相似性度量（1 - 余弦距离）

余弦距离由script.spatial.distance.cosine方便地提供。

我现在正在这样做：

使用itertools创建文档名称的所有2种组合的列表（dataframe列名称）
循环遍历这些并创建更新{doc1：{doc2：similarity}}的字典
在循环之后，使用pandas.DataFrame（dict）创建一个新框架

问题

但这需要很长时间。 下面显示了MacBook Pro 13的当前速度，16GB内存和2.9GHz i5cpu运行最新的anaconda python 3.5 ...绘制了对文档组合的时间。

您可以看到100,000个组合需要1200秒。 将其外推到我的7944个文档的语料库中，创建3个1,549,596个组合，需要5天才能计算出这个相似性矩阵！

有任何想法吗？

我以前是动态创建数据帧df.ix [doc1，doc2] =相似..这非常慢。
我考虑过numba @git，但它失败了pandas数据结构。
我找不到内置函数，它将在内部完成所有工作（在C？中）
我必须在战术上做的是随机抽样文档以创建一个更小的集合来使用...目前0.02的一小部分导致大约20分钟的计算！

这是代码（ github ）

docs_combinations = itertools.combinations(docs_sample, 2)
for doc1, doc2 in docs_combinations:
    # scipy cosine similarity function includes normalising the vectors but is a distance .. so we need to take it from 1.0
    doc_similarity_dict[doc2].update({doc1: 1.0 - scipy.spatial.distance.cosine(relevance_index[doc1],relevance_index[doc2])})
    pass

#convert dict to pandas dataframe
doc_similarity_matrix = pandas.DataFrame(doc_similarity_dict)

简单的例子

@MaxU要求说明一个例子。

相关矩阵（这里的wordcount，只是为了保持简单）：

...     doc1 doc2 doc3
wheel   2.   3.   0.
seat    2.   2.   0.
lights  0.   1.   1.
cake    0.   0.   5.

基于2组合（doc1，doc2），（doc2，doc3），（doc1，doc3）的计算相似度矩阵

...     doc2 doc3
doc1    0.9449  0.
doc2    -       0.052

取左上角的值0.889 ..这就是点积（2 * 3 + 2 * 2 + 0 + 0）= 10但是按矢量的长度标准化...所以除以sqrt（8）和sqrt（14）给出0.9449。 你可以看到doc1和doc3之间没有相似之处。点积为零。

将此文档从3个文档中缩放为4个单词...到7944个文档，这将创建3个1,549,596个组合...

Answer 1

这与我可以制作算法的效率差不多，而不需要进入多处理（bleh）。 该函数使用numpy数组进行所有计算。

def cos_sim(data_frame):
    # create a numpy array from the data frame
    a = data_frame.values
    # get the number of documents
    n = a.shape[-1]
    # create an array of size docs x docs to populate
    out = np.ravel(np.zeros(shape=(n, n)))

    for i in range(n):
        # roll the array one step at a time, calculating the cosine similarity each time
        r = np.roll(a, -i, axis=1)
        cs = np.sum(a[:,:n-i]*r[:,:n-i], axis=0) / (
                np.sqrt(np.sum(a[:,:n-i]*a[:,:n-i], axis=0))
                *np.sqrt(np.sum(r[:,:n-i]*r[:,:n-i], axis=0)))

        # push the cosine similarity to the output array's i-th off-diagonal
        out[i:n*n-i*n:n+1] = cs

    return out.reshape((n,n))

Answer 2

Numba将是一个很好的解决方案。 我想你知道，它不支持Pandas DataFrames，但它是围绕NumPy数组构建的。 这不是问题 - 您可以轻松快速地将DataFrame转换为2D数组并将其传递给Numba函数（这将是您已经拥有的代码，只是顶部的@njit装饰）。

另请注意，您可以使用方形矩阵的一个三角形来存储它们，而不是结果的dict-of-dicts：

     doc1 doc2 doc3
doc1  NAN  NAN  NAN
doc2  ...  NAN  NAN
doc3  ...  ...  NAN

编辑：你现在已经使用Numba实现了它，但只获得了2.5倍的加速。 我做了一些实验，发现了一个巨大的胜利：

In [66]: x = np.random.random((1000,1000))

In [67]: y = np.array(x, order='F')

In [68]: %timeit similarity_jit(x)
1 loop, best of 3: 13.7 s per loop

In [69]: %timeit similarity_jit(y)
1 loop, best of 3: 433 ms per loop

也就是说，如果由于缓存操作连续的数据块，您的算法会快得多。 由于你的算法的内核是numpy.dot(m[:,i], m[:,j]) ，并且m[:,i]需要一列，你最好在“Fortran order”中定位你的数据（首先是列[主要顺序]，因此m[:,i]给出一个连续的数组（因为数组在内存中“转置”了）。

计算pandas帧列组合之间距离的有效方法

问题描述

2 个解决方案

解决方案1
2 2016-11-16 14:04:11

解决方案2
1 已采纳 2016-11-16 13:45:05

计算pandas帧列组合之间距离的有效方法

问题描述

2 个解决方案

解决方案1 2 2016-11-16 14:04:11

解决方案2 1 已采纳 2016-11-16 13:45:05

解决方案1
2 2016-11-16 14:04:11

解决方案2
1 已采纳 2016-11-16 13:45:05