简体   繁体   English

pandas dataframe 中的每一行根据列表列中的多行计算总和

[英]Calculate sum based on multiple rows from list column for each row in pandas dataframe

I have a dataframe that looks something like this:我有一个 dataframe,看起来像这样:

df = pd.DataFrame({'id': range(5), 'col_to_sum': np.random.rand(5), 'list_col': [[], [1], [1,2,3], [2], [3,1]]})
    
    id  col_to_sum  list_col
0   0   0.557736    []
1   1   0.147333    [1]
2   2   0.538681    [1, 2, 3]
3   3   0.040329    [2]
4   4   0.984439    [3, 1]

In reality I have more columns and ~30000 rows but the extra columns are irrelevant for this.实际上,我有更多列和 ~30000 行,但额外的列与此无关。 Note that all the list elements are from the id column and that the id column is not necessarily the same as the index.请注意,所有列表元素都来自 id 列,并且 id 列不一定与索引相同。

I want to make a new column that for each row sums the values in col_to_sum corresponding to the ids in list_col.我想创建一个新列,对每一行求和 col_to_sum 中对应于 list_col 中的 id 的值。 In this example that would be:在这个例子中是:

    id  col_to_sum  list_col    sum
0   0   0.557736    []          0.000000
1   1   0.147333    [1]         0.147333
2   2   0.538681    [1, 2, 3]   0.726343
3   3   0.040329    [2]         0.538681
4   4   0.984439    [3, 1]      0.187662

I have found a way to do this but it requires looping through the entire dataframe and is quite slow on the larger df with ~30000 rows (~6 min).我找到了一种方法来执行此操作,但它需要遍历整个 dataframe,并且在具有 ~30000 行(~6 分钟)的较大 df 上非常慢。 The way I found was我发现的方式是

df['sum'] = 0

for i in range(len(df)):
    mask = df['id'].isin(df['list_col'].iloc[i])
    df.loc[i, 'sum'] = df.loc[mask, 'col_to_sum'].sum()

Ideally I would want a vectorized way to do this but I haven't been able to do it.理想情况下,我想要一种矢量化的方式来做到这一点,但我一直无法做到。 Any help is greatly appreciated.任何帮助是极大的赞赏。

I'm using non-random values in this demo because they're easier to reproduce and check.我在此演示中使用非随机值,因为它们更容易重现和检查。

I'm also using an id-column ( [0, 1, 3, 2, 4] ) that is not identical to the index.我还使用了与索引不同的 id 列 ( [0, 1, 3, 2, 4] )。

Setup:设置:

>>> df = pd.DataFrame({'id': [0, 1, 3, 2, 4], 'col_to_sum': [1, 2, 3, 4, 5], 'list_col': [[], [1], [1, 2, 3], [2], [3, 1]]})
>>> df
   id  col_to_sum   list_col
0   0           1         []
1   1           2        [1]
2   3           3  [1, 2, 3]
3   2           4        [2]
4   4           5     [3, 1]

Solution:解决方案:

df = df.set_index('id')
df['sum'] = df['list_col'].apply(lambda l: df.loc[l, 'col_to_sum'].sum())
df = df.reset_index()

Output: Output:

>>> df
   id  col_to_sum   list_col  sum
0   0           1         []    0
1   1           2        [1]    2
2   3           3  [1, 2, 3]    9
3   2           4        [2]    4
4   4           5     [3, 1]    5

You can use a lambda function that will let you use the list_col and find the iloc of the corresponding list_col to summarize你可以使用一个 lambda function 让你使用 list_col 并找到对应的 list_col 的 iloc 来总结

df['sum_col'] = df['list_col'].apply(lambda x : df['col_to_sum'].iloc[x].sum())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM