繁体   English   中英

给定稀疏矩阵的行ID,计算pairw余弦相似度

[英]Compute pairw cosine similarity given a row id of a sparse matrix

我想计算稀疏矩阵的一行与其余行之间的成对余弦相似度。 (为什么?:因为每一行都是矢量化的product_title,并且我想提取具有id值的相似产品)。

以前,我已将df_cleaned作为<504x41732 sparse matrix> (每行,一个产品标题,而各列<504x41732 sparse matrix>令牌引起)。

我定义了:

def pairw_cos(prod_idx):
    prod = df_cleaned[prod_idx]
    foll_idx = prod_idx + 1 #thats a trick to select the rest of rows on the following line
    candidates_matrix = scipy.sparse.vstack([df_cleaned[:prod_idx, :], df_cleaned[foll_idx:, :]])
    simil_cosine = {}

    for candidates_idx, single_candidate in candidates_matrix.iterrows():
        single_simil = cosine_similarity(prod,single_candidate)
        simil_cosine[candidates_idx] = single_simil
    return pd.Series(simil_cosine)

但这不起作用(因为稀疏矩阵中不存在iterrows方法)。 然后,我尝试:

for row in candidates_matrix:
    for candidates_idx, single_candidate in row:
        single_simil = cosine_similarity(prod,single_candidate)
        simil_cosine[candidates_idx] = single_simil

而且,在调用该函数时,我获得了:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-53-4c45754152cc> in <module>()
----> 1 pairw_cos2(2)

<ipython-input-52-12d55d3c35e5> in pairw_cos2(prod_idx)
      7 
      8     for row in candidates_matrix:
----> 9         for candidates_idx, single_candidate in row:
     10             single_simil = cosine_similarity(prod,single_candidate)
     11             simil_cosine[candidates_idx] = single_simil

ValueError: not enough values to unpack (expected 2, got 1)

如果有人问同样的问题,我终于解决了:

def pairwise_cosine(prod_idx):
    prod = df_cleaned[prod_idx]
    foll_idx = prod_idx + 1
    candidates_matrix = scipy.sparse.vstack([df_cleaned[:prod_idx, :], df_cleaned[foll_idx:, :]])
    simil_cosine = {}
    to_enumerate = []

    for row in candidates_matrix:
        simil_per_row= []
        simil_per_row = cosine_similarity(row,prod)
        to_enumerate.append(simil_per_row)
    for index, row in enumerate(candidates_matrix):
        simil_cosine[index] = to_enumerate[index]
    return pd.Series(simil_cosine)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM