简体   繁体   English

将按列自定义 function 应用于 2 个数据帧,创建一个新的 dataframe

[英]Apply column-wise custom function to 2 dataframes, creating a new dataframe

I have 2 dataframes:我有 2 个数据框:

df_up = pd.DataFrame({"u1":[2,-3,5,0], 
                      "u2":[1,0,5,-2]},
                      index=["ta","tb","tc","td"])

df_tt = pd.DataFrame({"q1":[1,0,1,0], 
                      "q2":[1,0,1,1],
                      "q3":[0,1,0,0]},
                      index=["ta","tb","tc","td"])

I want to create a new dataframe that calculates the cosine similarity between all columns of df_up and all columns of df_tt.我想创建一个新的 dataframe 来计算 df_up 的所有列和 df_tt 的所有列之间的余弦相似度。 Both dataframes have the same number of rows.Ideally, the solution would work with a custom function, such as:两个数据框的行数相同。理想情况下,该解决方案将与自定义 function 一起使用,例如:

from scipy import spatial
def cosine_similarity(array_1, array_2):
    return 1 - spatial.distance.cosine(array_1,array_2)

The result would look like this:结果将如下所示:

    u1       u2
q1  0.8029   0.7745
q2  0.6556   0.4216
q3  -0.4866  0.0

Is there an "elegant" way of solving this or is iterating through the 2 dataframes the only way?是否有解决此问题的“优雅”方法,还是唯一的方法是遍历 2 个数据帧?

Solution from cdist cdist的解决方案

from scipy.spatial.distance import cdist
ary=(1-cdist(df_up.T.values, df_tt.T.values, metric='cosine')).T
df=pd.DataFrame(ary,columns=df_up.columns,index=df_tt.columns)
Out[258]: 
          u1        u2
q1  0.802955  0.774597
q2  0.655610  0.421637
q3 -0.486664  0.000000

A generic way is to use corr with a callable method, see below,一种通用的方法是将corrcallable方法一起使用,见下文,

# There was a typo in the original method: array_1, array_2

def cosine_similarity(array1, array2):
    return 1 - spatial.distance.cosine(array1,array2)

output = (pd.concat([df_up, df_tt], axis=1)
            .corr(method=cosine_similarity)
            .drop(columns=df_tt.columns, index=df_up.columns))


声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM