在 DataFrame 中每隔一行执行 function 的更快方法？

Question

我想对 dataframe 中的每一行执行一个操作。显而易见的方法是使用嵌套 for 循环，这预计会非常慢。

寻求有关更快更好地实现同一目标的建议？

This is dataframe where each row is a user vector, with index set as usernames. In actual there can be hundreds of usernames

import pandas as pd
df1 = pd.DataFrame({"A":[11,2,3], "B":[4,5,6], "C":[7,8,9]}, index=["U1","U2", "U3"])

Nested Loop Method

import numpy as np
def some_func(u1_vec,u2_vec):
    # this could be any function using above 2 user vectors
    return np.minimum(u1_vec, u2_vec).sum()/np.maximum(u1_vec, u2_vec).sum()


index_list = list(df1.index) # contains usernames
vector_cols = list(df1.columns) # contains colnames

min_max_all = {} # will be used to store the vector interaction 
for index_u1 in index_list:
    u1_vec = df1.loc[index_u1, vector_cols]
    min_max_all[index_u1] = {}
    for index_u2 in index_list:
        u2_vec = df1.loc[index_u2, vector_cols]
        min_max_all[index_u1][index_u2] = some_func(u1_vec, u2_vec)

Result - min_max_all

{
'U1': {'U1': 1.0, 'U2': 0.5416666666666666, 'U3': 0.5384615384615384},
'U2': {'U1': 0.5416666666666666, 'U2': 1.0, 'U3': 0.8333333333333334},
'U3': {'U1': 0.5384615384615384, 'U2': 0.8333333333333334, 'U3': 1.0}
}

Answer 1

我认为最好的方法是使用 numpy，并为一个目的编写一个代码。

import pandas as pd
import numpy as np

df1 = pd.DataFrame({"A":[11,2,3], "B":[4,5,6], "C":[7,8,9]}, index=["U1","U2", "U3"])
df1_np = df1.to_numpy()

x = np.minimum(df1_np[:, np.newaxis], df1_np).sum(axis=2)
y = np.maximum(df1_np[:, np.newaxis], df1_np).sum(axis=2)

print(x/y)
array([[1.        , 0.54166667, 0.53846154],
       [0.54166667, 1.        , 0.83333333],
       [0.53846154, 0.83333333, 1.        ]])

在问题中制作像你这样的字典

z = x/y
{ci: {cj: z[i][j] for j, cj in enumerate(df1.columns)} 
    for i, ci in enumerate(df1.columns)}

在 DataFrame 中每隔一行执行 function 的更快方法？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-02-26 12:51:30

在 DataFrame 中每隔一行执行 function 的更快方法？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-02-26 12:51:30

解决方案1
1 已采纳 2022-02-26 12:51:30