在 DataFrame 中每隔一行执行 function 的更快方法？

Question

I want to perform an operation of each row with every other row in a dataframe. The obvious way is to use nested for loops and that is expectedly very slow.我想对 dataframe 中的每一行执行一个操作。显而易见的方法是使用嵌套 for 循环，这预计会非常慢。

Seeking suggestions on faster and better way to achieve the same thing?寻求有关更快更好地实现同一目标的建议？

This is dataframe where each row is a user vector, with index set as usernames. In actual there can be hundreds of usernames

import pandas as pd
df1 = pd.DataFrame({"A":[11,2,3], "B":[4,5,6], "C":[7,8,9]}, index=["U1","U2", "U3"])

Nested Loop Method

import numpy as np
def some_func(u1_vec,u2_vec):
    # this could be any function using above 2 user vectors
    return np.minimum(u1_vec, u2_vec).sum()/np.maximum(u1_vec, u2_vec).sum()


index_list = list(df1.index) # contains usernames
vector_cols = list(df1.columns) # contains colnames

min_max_all = {} # will be used to store the vector interaction 
for index_u1 in index_list:
    u1_vec = df1.loc[index_u1, vector_cols]
    min_max_all[index_u1] = {}
    for index_u2 in index_list:
        u2_vec = df1.loc[index_u2, vector_cols]
        min_max_all[index_u1][index_u2] = some_func(u1_vec, u2_vec)

Result - min_max_all

{
'U1': {'U1': 1.0, 'U2': 0.5416666666666666, 'U3': 0.5384615384615384},
'U2': {'U1': 0.5416666666666666, 'U2': 1.0, 'U3': 0.8333333333333334},
'U3': {'U1': 0.5384615384615384, 'U2': 0.8333333333333334, 'U3': 1.0}
}

Answer 1

I think the best way is with numpy, and write one code for one purpose.我认为最好的方法是使用 numpy，并为一个目的编写一个代码。

import pandas as pd
import numpy as np

df1 = pd.DataFrame({"A":[11,2,3], "B":[4,5,6], "C":[7,8,9]}, index=["U1","U2", "U3"])
df1_np = df1.to_numpy()

x = np.minimum(df1_np[:, np.newaxis], df1_np).sum(axis=2)
y = np.maximum(df1_np[:, np.newaxis], df1_np).sum(axis=2)

print(x/y)
array([[1.        , 0.54166667, 0.53846154],
       [0.54166667, 1.        , 0.83333333],
       [0.53846154, 0.83333333, 1.        ]])

To make a dictionary like yours in the question在问题中制作像你这样的字典

z = x/y
{ci: {cj: z[i][j] for j, cj in enumerate(df1.columns)} 
    for i, ci in enumerate(df1.columns)}

在 DataFrame 中每隔一行执行 function 的更快方法？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-02-26 12:51:30

在 DataFrame 中每隔一行执行 function 的更快方法？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-02-26 12:51:30

解决方案1
1 已采纳 2022-02-26 12:51:30