将自定义函数应用于pandas数据框中的每个列组合

Question

I'm trying to calculate the cosine similarity between each combination of columns in my pandas dataframe. 我正在尝试计算我的熊猫数据框中各列组合之间的余弦相似度。 I've written a custom function to calculate cosine similarity, and now need to apply it to every combination pair of columns. 我已经编写了一个自定义函数来计算余弦相似度，现在需要将其应用于每对组合的列。 Each column contains a 0 if a user has not interacted with it, and a 1 if the user has. 如果用户尚未与之交互，则每一列都包含0；如果用户未与之交互，则每一列均包含1。 Each row therefore contains the total viewing behaviour of the user. 因此，每一行都包含用户的总观看行为。

Currently using a for loop, but its too slow for larger samples of data - eg my current sample is 3408 columns x 28000 rows. 当前使用for循环，但是对于较大的数据样本来说太慢了-例如，我当前的样本为3408列x 28000行。

My guess is a lambda function is the way to go, but I'm unsure how to apply it properly. 我的猜测是lambda函数是可行的方法，但是我不确定如何正确应用它。

Initial dataframe: 初始数据帧：

sm_views = pd.read_sql(postgreSQL_select_Query, connection).groupby().size().unstack(fill_value=0)

Cos rating function: Cos评级功能：

def cos_rating_calculator(x, y):
    dot_product = np.dot(x, y)
    distance1 = np.sqrt(sum(x))
    distance2 = np.sqrt(sum(y))
    cos_rating = dot_product / (distance1 * distance2)
    return cos_rating

Code to calculate association combinations: 计算关联组合的代码：

combinations = list(itertools.combinations(sm_views.columns, 2))

results = []

    for a, b in combinations:
        association_metric = cos_rating_calculator(sm_views[a], sm_views[b])
        results.append((a, b, association_metric))
        results.append((b, a, association_metric))

to_matrix = pd.DataFrame(results, columns=['a', 'b', 'association'])
association_matrix = to_matrix.pivot(index='a', columns='b', values='association')

For smaller datasets this works fine, however the current dataset is too large for this method to be feasible. 对于较小的数据集，此方法工作良好，但是当前数据集太大，以致于该方法不可行。 My desired output is a column x column matrix with the degree of association between columns as values. 我想要的输出是一列x列矩阵，其中列之间的关联度为值。

Answer 1

import scipy.spatial.distance
result = pd.DataFrame(list(itertools.combinations(sm_views.columns, 2)), columns=['a','b'])
result['association'] = scipy.spatial.distance.pdist(sm_views.T, 'cosine')

With this example sm_view: 通过此示例sm_view：

   col1  col2  col3
0     0     0     0
1     3     4     2
2     1     1     5

we get 我们得到

      a     b  association
0  col1  col2     0.002946
1  col1  col3     0.354058
2  col2  col3     0.414509

将自定义函数应用于pandas数据框中的每个列组合

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-07-24 15:21:30

将自定义函数应用于pandas数据框中的每个列组合

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-07-24 15:21:30

解决方案1
0 已采纳 2019-07-24 15:21:30