[英]How to speed up pair-wise operation in numpy / pandas
我目前有一个超过 100,000 行和超过 100 列的dta
dta,其中dta[i, j]
是元素列表。 我的目标是计算一个对称表a
其中a[i,j] = mean([len(intersect(dta[k, i],dta[k, j]))])
,即对于每两列,计算成对的交点数,然后取所有行的平均值。
创建示例的简单代码是
dta = pd.DataFrame(
{
"a" : ["a", "b", "a", "a", "b", "a"],
"b" : ["a", "b", "a","a", "b", "a"],
"c" : ["a", "ee", "c","a", "b", "a"],
"d" : ["aaa b", "bbb a", "ccc c","a", "b", "a"]
}
)
dta = dta.applymap(lambda x : x.split() )
table = pd.DataFrame(np.zeros((4,4)))
for i in range(4) :
for j in range(i, 4) :
table.iloc[i,j] = dta.apply(
lambda x : len(set(x[i]).intersection(set(x[j]))), axis=1
).mean()
table
示例输入是
a b c d
0 [a] [a] [a] [aaa, b]
1 [b] [b] [ee][bbb, a]
2 [a] [a] [c] [ccc, c]
3 [a] [a] [a] [a]
4 [b] [b] [b] [b]
5 [a] [a] [a] [a]
output 是
0 1 2 3
0 1.0 1.0 0.666667 0.500000
1 0.0 1.0 0.666667 0.500000
2 0.0 0.0 1.000000 0.666667
3 0.0 0.0 0.000000 1.500000
我目前的方法如下:
def func(row, col1, col2) -> float :
list1, list2 = row[col1], row[col2]
return len(set(list1).intersection(list2))
for col_id, col in enumerate(colnames) :
for tgt_col_id in range(col_id, col_num) :
a.loc[col_id, tgt_col_id] = (
dta.apply(func, args=(col, colnames[tgt_col_id]), axis=1
).mean()
我的想法是,我可能可以加快列循环中的多处理,因为每个对操作并不重合。 但是有没有numpy / pandas
方法来加快两列之间的操作?
加快处理速度的想法会有所帮助!
不确定这将为您的完整数据提供多少速度提升,但此代码消除了循环。
from itertools import combinations_with_replacement
import pandas as pd
import numpy as np
dta = pd.DataFrame(
{
"a" : ["a", "b", "a", "a", "b", "a"],
"b" : ["a", "b", "a","a", "b", "a"],
"c" : ["a", "ee", "c","a", "b", "a"],
"d" : ["aaa b", "bbb a", "ccc c","a", "b", "a"]
}
)
dta = dta.applymap(lambda x : x.split() )
table = np.zeros((4,4))
iter_arr = (
np.array(list(combinations_with_replacement(range(table.shape[1]), 2))
)
# code above creates array of column combinations:
# array([[0, 0],
# [0, 1],
# [0, 2],
# [0, 3],
# [1, 1],
# [1, 2],
# [1, 3],
# [2, 2],
# [2, 3],
# [3, 3]])
def mean_of_set_intersects(x, y):
f = lambda a: set(a)
vset = np.vectorize(f)
x_set = vset(x)
y_set = vset(y)
f2 = lambda x, y: len(x.intersection(y))
vlenint = np.vectorize(f2)
return np.mean(vlenint(x_set,y_set), axis=1)
table[iter_arr[:,0],iter_arr[:,1]] = (
mean_of_set_intersects(
dta.values.transpose()[iter_arr][:,0,:],
dta.values.transpose()[iter_arr][:,1,:]
)
)
pd.DataFrame(table)
# 0 1 2 3
# 0 1.0 1.0 0.666667 0.500000
# 1 0.0 1.0 0.666667 0.500000
# 2 0.0 0.0 1.000000 0.666667
# 3 0.0 0.0 0.000000 1.500000
这个答案被用作参考: https://stackoverflow.com/a/49821744/9987623
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.