[英]Calculate sum based on multiple rows from list column for each row in pandas dataframe
I have a dataframe that looks something like this:我有一个 dataframe,看起来像这样:
df = pd.DataFrame({'id': range(5), 'col_to_sum': np.random.rand(5), 'list_col': [[], [1], [1,2,3], [2], [3,1]]})
id col_to_sum list_col
0 0 0.557736 []
1 1 0.147333 [1]
2 2 0.538681 [1, 2, 3]
3 3 0.040329 [2]
4 4 0.984439 [3, 1]
In reality I have more columns and ~30000 rows but the extra columns are irrelevant for this.实际上,我有更多列和 ~30000 行,但额外的列与此无关。 Note that all the list elements are from the id column and that the id column is not necessarily the same as the index.
请注意,所有列表元素都来自 id 列,并且 id 列不一定与索引相同。
I want to make a new column that for each row sums the values in col_to_sum corresponding to the ids in list_col.我想创建一个新列,对每一行求和 col_to_sum 中对应于 list_col 中的 id 的值。 In this example that would be:
在这个例子中是:
id col_to_sum list_col sum
0 0 0.557736 [] 0.000000
1 1 0.147333 [1] 0.147333
2 2 0.538681 [1, 2, 3] 0.726343
3 3 0.040329 [2] 0.538681
4 4 0.984439 [3, 1] 0.187662
I have found a way to do this but it requires looping through the entire dataframe and is quite slow on the larger df with ~30000 rows (~6 min).我找到了一种方法来执行此操作,但它需要遍历整个 dataframe,并且在具有 ~30000 行(~6 分钟)的较大 df 上非常慢。 The way I found was
我发现的方式是
df['sum'] = 0
for i in range(len(df)):
mask = df['id'].isin(df['list_col'].iloc[i])
df.loc[i, 'sum'] = df.loc[mask, 'col_to_sum'].sum()
Ideally I would want a vectorized way to do this but I haven't been able to do it.理想情况下,我想要一种矢量化的方式来做到这一点,但我一直无法做到。 Any help is greatly appreciated.
任何帮助是极大的赞赏。
I'm using non-random values in this demo because they're easier to reproduce and check.我在此演示中使用非随机值,因为它们更容易重现和检查。
I'm also using an id-column ( [0, 1, 3, 2, 4]
) that is not identical to the index.我还使用了与索引不同的 id 列 (
[0, 1, 3, 2, 4]
)。
Setup:设置:
>>> df = pd.DataFrame({'id': [0, 1, 3, 2, 4], 'col_to_sum': [1, 2, 3, 4, 5], 'list_col': [[], [1], [1, 2, 3], [2], [3, 1]]})
>>> df
id col_to_sum list_col
0 0 1 []
1 1 2 [1]
2 3 3 [1, 2, 3]
3 2 4 [2]
4 4 5 [3, 1]
Solution:解决方案:
df = df.set_index('id')
df['sum'] = df['list_col'].apply(lambda l: df.loc[l, 'col_to_sum'].sum())
df = df.reset_index()
Output: Output:
>>> df
id col_to_sum list_col sum
0 0 1 [] 0
1 1 2 [1] 2
2 3 3 [1, 2, 3] 9
3 2 4 [2] 4
4 4 5 [3, 1] 5
You can use a lambda function that will let you use the list_col and find the iloc of the corresponding list_col to summarize你可以使用一个 lambda function 让你使用 list_col 并找到对应的 list_col 的 iloc 来总结
df['sum_col'] = df['list_col'].apply(lambda x : df['col_to_sum'].iloc[x].sum())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.