[英]Efficient way to sum all possible pairs
I have a dataframe that looks like this: 我有一个看起来像这样的数据框:
from random import randint
import pandas as pd
df = pd.DataFrame({"ID": ["a", "b", "c", "d", "e", "f", "g"],
"Size": [randint(0,9) for i in range(0,7)]})
df
ID Size
0 a 4
1 b 3
2 c 0
3 d 2
4 e 9
5 f 5
6 g 3
And what I would like to obtain is this (could be a matrix as well): 我想获得的是这个(也可以是矩阵):
sums_df
a b c d e f g
a 8.0 7.0 4.0 6.0 13.0 9.0 7.0
b 7.0 6.0 3.0 5.0 12.0 8.0 6.0
c 4.0 3.0 0.0 2.0 9.0 5.0 3.0
d 6.0 5.0 2.0 4.0 11.0 7.0 5.0
e 13.0 12.0 9.0 11.0 18.0 14.0 12.0
f 9.0 8.0 5.0 7.0 14.0 10.0 8.0
g 7.0 6.0 3.0 5.0 12.0 8.0 6.0
That is, the sum of Size
values for all possible pairs in ID
. 也就是说,
ID
所有可能的对的Size
值之和。
For now I have this simple but unefficient code: 现在,我有这个简单但效率不高的代码:
sums_df = pd.DataFrame()
for i in range(len(df)):
for j in range(len(df)):
sums_df.loc[i,j] = df.Size[i] + df.Size[j]
sums_df.index = list(df.ID)
sums_df.columns = list(df.ID)
It works fine for small examples like this, but for my actual data it gets too long and I am sure it is possible to avoid the nested for
loops. 对于像这样的小示例,它工作得很好,但是对于我的实际数据来说,它太长了,我确信可以避免嵌套的
for
循环。 Can you think of a better way to do this ? 您能想到一种更好的方法吗?
Thanks for any help ! 谢谢你的帮助 !
use np.add.outer() : 使用np.add.outer() :
In [65]: pd.DataFrame(np.add.outer(df['Size'], df['Size']),
columns=df['ID'].values,
index=df['ID'].values)
Out[65]:
a b c d e f g
a 8 7 4 6 13 9 7
b 7 6 3 5 12 8 6
c 4 3 0 2 9 5 3
d 6 5 2 4 11 7 5
e 13 12 9 11 18 14 12
f 9 8 5 7 14 10 8
g 7 6 3 5 12 8 6
UPDATE: memory-saving (Pandas Multi-Index) approach (NOTE: this approach is much slower, compared to the previous one): 更新:节省内存(熊猫多索引)方法(注意:与前一种方法相比,此方法要慢得多):
In [33]: r = pd.DataFrame(np.array(list(combinations(df['Size'], 2))).sum(axis=1),
...: index=pd.MultiIndex.from_tuples(list(combinations(df['ID'], 2))),
...: columns=['TotalSize']
...: )
In [34]: r
Out[34]:
TotalSize
a b 7
c 4
d 6
e 13
f 9
g 7
b c 3
d 5
e 12
f 8
g 6
c d 2
e 9
f 5
g 3
d e 11
f 7
g 5
e f 14
g 12
f g 8
It can be accessed as follows: 可以按以下方式访问它:
In [41]: r.loc[('a','b')]
Out[41]:
TotalSize 7
Name: (a, b), dtype: int32
In [42]: r.loc[('a','b'), 'TotalSize']
Out[42]: 7
In [44]: r.loc[[('a','b'), ('c','d')], 'TotalSize']
Out[44]:
a b 7
c d 2
Name: TotalSize, dtype: int32
In [43]: r.at[('a','b'), 'TotalSize']
Out[43]: 7
Memory usage comparison (DF shape: 7000x3
): 内存使用情况比较(DF形状:
7000x3
):
In [65]: df = pd.concat([df] * 1000, ignore_index=True)
In [66]: df.shape
Out[66]: (7000, 2)
In [67]: r1 = pd.DataFrame(np.add.outer(df['Size'], df['Size']),
...: columns=df['ID'].values,
...: index=df['ID'].values)
...:
In [68]: r2 = pd.DataFrame(np.array(list(combinations(df['Size'], 2))).sum(axis=1),
...: index=pd.MultiIndex.from_tuples(list(combinations(df['ID'], 2))),
...: columns=['TotalSize'])
...:
In [69]: r1.memory_usage().sum()/r2.memory_usage().sum()
Out[69]: 2.6685407829018244
Speed comparison (DF shape: 7000x3
): 速度比较(DF形状:
7000x3
):
In [70]: %%timeit
...: r1 = pd.DataFrame(np.add.outer(df['Size'], df['Size']),
...: columns=df['ID'].values,
...: index=df['ID'].values)
...:
180 ms ± 2.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [71]: %%timeit
...: r2 = pd.DataFrame(np.array(list(combinations(df['Size'], 2))).sum(axis=1),
...: index=pd.MultiIndex.from_tuples(list(combinations(df['ID'], 2))),
...: columns=['TotalSize'])
...:
17 s ± 325 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Use Numpy's broadcasting 使用Numpy的广播
size = df.Size.values
ids = df.ID.values
pd.DataFrame(
size[:, None] + size,
ids, ids
)
a b c d e f g
a 8 7 4 6 13 9 7
b 7 6 3 5 12 8 6
c 4 3 0 2 9 5 3
d 6 5 2 4 11 7 5
e 13 12 9 11 18 14 12
f 9 8 5 7 14 10 8
g 7 6 3 5 12 8 6
Or something like .values
and .values.T
或类似
.values
和.values.T
东西
df1=df.set_index('ID')
df1.values+df1.values.T
Out[626]:
array([[ 8, 7, 4, 6, 13, 9, 7],
[ 7, 6, 3, 5, 12, 8, 6],
[ 4, 3, 0, 2, 9, 5, 3],
[ 6, 5, 2, 4, 11, 7, 5],
[13, 12, 9, 11, 18, 14, 12],
[ 9, 8, 5, 7, 14, 10, 8],
[ 7, 6, 3, 5, 12, 8, 6]], dtype=int64)
More info : 更多信息 :
pd.DataFrame(data=df1.values+df1.values.T,index=df.index,columns=df.index)
Out[627]:
ID a b c d e f g
ID
a 8 7 4 6 13 9 7
b 7 6 3 5 12 8 6
c 4 3 0 2 9 5 3
d 6 5 2 4 11 7 5
e 13 12 9 11 18 14 12
f 9 8 5 7 14 10 8
g 7 6 3 5 12 8 6
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.