如何加快 numpy / pandas 中的成对操作

Question

我目前有一个超过 100,000 行和超过 100 列的dta dta，其中dta[i, j]是元素列表。 我的目标是计算一个对称表a其中a[i,j] = mean([len(intersect(dta[k, i],dta[k, j]))]) ，即对于每两列，计算成对的交点数，然后取所有行的平均值。

创建示例的简单代码是

dta = pd.DataFrame(
    {
        "a" : ["a", "b", "a", "a", "b", "a"],
        "b" : ["a", "b", "a","a", "b", "a"],
        "c" : ["a", "ee", "c","a", "b", "a"],
        "d" : ["aaa b", "bbb a", "ccc c","a", "b", "a"]
    }
)
dta = dta.applymap(lambda x : x.split() )
table = pd.DataFrame(np.zeros((4,4)))
for i in range(4) : 
    for j in range(i, 4) : 
        table.iloc[i,j] = dta.apply(
            lambda x : len(set(x[i]).intersection(set(x[j]))), axis=1
        ).mean()
table

示例输入是

     a   b   c   d
0   [a] [a] [a] [aaa, b]
1   [b] [b] [ee][bbb, a]
2   [a] [a] [c] [ccc, c]
3   [a] [a] [a] [a]
4   [b] [b] [b] [b]
5   [a] [a] [a] [a]

output 是

    0    1      2           3
0   1.0 1.0 0.666667    0.500000
1   0.0 1.0 0.666667    0.500000
2   0.0 0.0 1.000000    0.666667
3   0.0 0.0 0.000000    1.500000

我目前的方法如下：

def func(row, col1, col2) -> float : 
    list1, list2 = row[col1], row[col2]
    return len(set(list1).intersection(list2))

for col_id, col in enumerate(colnames) : 
    for tgt_col_id in range(col_id, col_num) : 
        a.loc[col_id, tgt_col_id] = (
            dta.apply(func, args=(col, colnames[tgt_col_id]), axis=1
        ).mean()

我的想法是，我可能可以加快列循环中的多处理，因为每个对操作并不重合。 但是有没有numpy / pandas方法来加快两列之间的操作？

加快处理速度的想法会有所帮助！

Answer 1

不确定这将为您的完整数据提供多少速度提升，但此代码消除了循环。

from itertools import combinations_with_replacement
import pandas as pd
import numpy as np

dta = pd.DataFrame(
    {
        "a" : ["a", "b", "a", "a", "b", "a"],
        "b" : ["a", "b", "a","a", "b", "a"],
        "c" : ["a", "ee", "c","a", "b", "a"],
        "d" : ["aaa b", "bbb a", "ccc c","a", "b", "a"]
    }
)
dta = dta.applymap(lambda x : x.split() )

table = np.zeros((4,4))
iter_arr = (
    np.array(list(combinations_with_replacement(range(table.shape[1]), 2))
)
# code above creates array of column combinations:
# array([[0, 0],
#        [0, 1],
#        [0, 2],
#        [0, 3],
#        [1, 1],
#        [1, 2],
#        [1, 3],
#        [2, 2],
#        [2, 3],
#        [3, 3]])

def mean_of_set_intersects(x, y):
    f = lambda a: set(a)
    vset = np.vectorize(f)
    x_set = vset(x)
    y_set = vset(y)

    f2 = lambda x, y: len(x.intersection(y))
    vlenint = np.vectorize(f2)

    return np.mean(vlenint(x_set,y_set), axis=1)

table[iter_arr[:,0],iter_arr[:,1]] = (
    mean_of_set_intersects(
        dta.values.transpose()[iter_arr][:,0,:],
        dta.values.transpose()[iter_arr][:,1,:]
    )
)
pd.DataFrame(table)
#      0      1        2           3
# 0  1.0    1.0 0.666667    0.500000
# 1  0.0    1.0 0.666667    0.500000
# 2  0.0    0.0 1.000000    0.666667
# 3  0.0    0.0 0.000000    1.500000

这个答案被用作参考： https://stackoverflow.com/a/49821744/9987623

如何加快 numpy / pandas 中的成对操作

问题描述

1 个解决方案

解决方案1
0 2021-04-18 03:19:50

如何加快 numpy / pandas 中的成对操作

问题描述

1 个解决方案

解决方案1 0 2021-04-18 03:19:50

解决方案1
0 2021-04-18 03:19:50