简体   繁体   English

求和所有可能对的有效方法

[英]Efficient way to sum all possible pairs

I have a dataframe that looks like this: 我有一个看起来像这样的数据框:

from random import randint
import pandas as pd

df = pd.DataFrame({"ID": ["a", "b", "c", "d", "e", "f", "g"], 
                   "Size": [randint(0,9) for i in range(0,7)]})

df

  ID  Size
0  a     4
1  b     3
2  c     0
3  d     2
4  e     9
5  f     5
6  g     3

And what I would like to obtain is this (could be a matrix as well): 我想获得的是这个(也可以是矩阵):

sums_df

      a     b    c     d     e     f     g
a   8.0   7.0  4.0   6.0  13.0   9.0   7.0
b   7.0   6.0  3.0   5.0  12.0   8.0   6.0
c   4.0   3.0  0.0   2.0   9.0   5.0   3.0
d   6.0   5.0  2.0   4.0  11.0   7.0   5.0
e  13.0  12.0  9.0  11.0  18.0  14.0  12.0
f   9.0   8.0  5.0   7.0  14.0  10.0   8.0
g   7.0   6.0  3.0   5.0  12.0   8.0   6.0

That is, the sum of Size values for all possible pairs in ID . 也就是说, ID所有可能的对的Size值之和。

For now I have this simple but unefficient code: 现在,我有这个简单但效率不高的代码:

sums_df = pd.DataFrame()

for i in range(len(df)):
    for j in range(len(df)):
        sums_df.loc[i,j] = df.Size[i] + df.Size[j]

sums_df.index = list(df.ID)
sums_df.columns = list(df.ID)

It works fine for small examples like this, but for my actual data it gets too long and I am sure it is possible to avoid the nested for loops. 对于像这样的小示例,它工作得很好,但是对于我的实际数据来说,它太长了,我确信可以避免嵌套的for循环。 Can you think of a better way to do this ? 您能想到一种更好的方法吗?

Thanks for any help ! 谢谢你的帮助 !

use np.add.outer() : 使用np.add.outer()

In [65]: pd.DataFrame(np.add.outer(df['Size'], df['Size']),
                      columns=df['ID'].values,
                      index=df['ID'].values)
Out[65]:
    a   b  c   d   e   f   g
a   8   7  4   6  13   9   7
b   7   6  3   5  12   8   6
c   4   3  0   2   9   5   3
d   6   5  2   4  11   7   5
e  13  12  9  11  18  14  12
f   9   8  5   7  14  10   8
g   7   6  3   5  12   8   6

UPDATE: memory-saving (Pandas Multi-Index) approach (NOTE: this approach is much slower, compared to the previous one): 更新:节省内存(熊猫多索引)方法(注意:与前一种方法相比,此方法要慢得多):

In [33]: r = pd.DataFrame(np.array(list(combinations(df['Size'], 2))).sum(axis=1),
    ...:                  index=pd.MultiIndex.from_tuples(list(combinations(df['ID'], 2))),
    ...:                  columns=['TotalSize']
    ...: )

In [34]: r
Out[34]:
     TotalSize
a b          7
  c          4
  d          6
  e         13
  f          9
  g          7
b c          3
  d          5
  e         12
  f          8
  g          6
c d          2
  e          9
  f          5
  g          3
d e         11
  f          7
  g          5
e f         14
  g         12
f g          8

It can be accessed as follows: 可以按以下方式访问它:

In [41]: r.loc[('a','b')]
Out[41]:
TotalSize    7
Name: (a, b), dtype: int32

In [42]: r.loc[('a','b'), 'TotalSize']
Out[42]: 7

In [44]: r.loc[[('a','b'), ('c','d')], 'TotalSize']
Out[44]:
a  b    7
c  d    2
Name: TotalSize, dtype: int32

In [43]: r.at[('a','b'), 'TotalSize']
Out[43]: 7

Memory usage comparison (DF shape: 7000x3 ): 内存使用情况比较(DF形状: 7000x3 ):

In [65]: df = pd.concat([df] * 1000, ignore_index=True)

In [66]: df.shape
Out[66]: (7000, 2)

In [67]: r1 = pd.DataFrame(np.add.outer(df['Size'], df['Size']),
    ...:                       columns=df['ID'].values,
    ...:                       index=df['ID'].values)
    ...:

In [68]: r2 = pd.DataFrame(np.array(list(combinations(df['Size'], 2))).sum(axis=1),
    ...:                  index=pd.MultiIndex.from_tuples(list(combinations(df['ID'], 2))),
    ...:                  columns=['TotalSize'])
    ...:

In [69]: r1.memory_usage().sum()/r2.memory_usage().sum()
Out[69]: 2.6685407829018244

Speed comparison (DF shape: 7000x3 ): 速度比较(DF形状: 7000x3 ):

In [70]: %%timeit
    ...: r1 = pd.DataFrame(np.add.outer(df['Size'], df['Size']),
    ...:                       columns=df['ID'].values,
    ...:                       index=df['ID'].values)
    ...:
180 ms ± 2.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [71]: %%timeit
    ...: r2 = pd.DataFrame(np.array(list(combinations(df['Size'], 2))).sum(axis=1),
    ...:                  index=pd.MultiIndex.from_tuples(list(combinations(df['ID'], 2))),
    ...:                  columns=['TotalSize'])
    ...:
17 s ± 325 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Use Numpy's broadcasting 使用Numpy的广播

size = df.Size.values
ids = df.ID.values

pd.DataFrame(
    size[:, None] + size,
    ids, ids
)

    a   b  c   d   e   f   g
a   8   7  4   6  13   9   7
b   7   6  3   5  12   8   6
c   4   3  0   2   9   5   3
d   6   5  2   4  11   7   5
e  13  12  9  11  18  14  12
f   9   8  5   7  14  10   8
g   7   6  3   5  12   8   6

Or something like .values and .values.T 或类似.values.values.T东西

df1=df.set_index('ID')
df1.values+df1.values.T
Out[626]: 
array([[ 8,  7,  4,  6, 13,  9,  7],
       [ 7,  6,  3,  5, 12,  8,  6],
       [ 4,  3,  0,  2,  9,  5,  3],
       [ 6,  5,  2,  4, 11,  7,  5],
       [13, 12,  9, 11, 18, 14, 12],
       [ 9,  8,  5,  7, 14, 10,  8],
       [ 7,  6,  3,  5, 12,  8,  6]], dtype=int64)

More info : 更多信息 :

pd.DataFrame(data=df1.values+df1.values.T,index=df.index,columns=df.index)
Out[627]: 
ID   a   b  c   d   e   f   g
ID                           
a    8   7  4   6  13   9   7
b    7   6  3   5  12   8   6
c    4   3  0   2   9   5   3
d    6   5  2   4  11   7   5
e   13  12  9  11  18  14  12
f    9   8  5   7  14  10   8
g    7   6  3   5  12   8   6

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 对给定 k 求和所有可能对 (x_ik, y_j) 的最有效方法? - The most efficient way to sum all possible pairs (x_ik, y_j) for a given k? 查找列表中所有连接的整数对之和的高效算法 - Efficient algorithm to find the sum of all concatenated pairs of integers in a list 在大型矩阵的所有行对上进行计算的有效方法? - Efficient way to calculate on all row-pairs of a large matrix? 在不使用嵌套循环的情况下查找列表中所有对的有效方法 - Efficient way to find all the pairs in a list without using nested loop 所有可能的因素对 - All possible pairs of factors 有没有一种有效的方法可以找到所有长度为 10 的 integer 元组总和为 100 - Is there an efficient way to find all integer tuples of length 10 that sum to 100 列表中所有对的乘积之和 - Sum of products of all pairs in a list 在双向网络中选择共享共同邻居的所有顶点对的有效方法 - Efficient way to select all pairs of vertices that share common neighbors in a bipartite network 构造1到n的四元组的所有可能组合的最有效方法 - Most efficient way to construct all possible combinations of a quadruple for 1 to n 计算 numpy 中向量内所有可能乘法元素的有效方法 - efficient way to calculate all possible multiplication elements inside a vector in numpy
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM