如何在Python中为多个列按累积和计算组

Question

I have a data set like, 我有一个数据集

data=pd.DataFrame({'id':pd.Series([1,1,1,2,2,3,3,3]),'var1':pd.Series([1,2,3,4,5,6,7,8]),'var2':pd.Series([11,12,13,14,15,16,17,18]),
'var3':pd.Series([21,22,23,24,25,26,27,28])})

Here I need to calculate groupwise cumulative sum for all columns(var1,var2,var3) based on id. 在这里，我需要基于id计算所有列（var1，var2，var3）的分组累积总和。 How can I write python code to crate output as per my requirement? 如何编写python代码以根据需要包装输出？

Thanks in advance. 提前致谢。

Answer 1

If I have understood you right, you can use DataFrame.groupby to calculate the cumulative sum across columns grouped by your 'id' -column. 如果我没DataFrame.groupby ，您可以使用DataFrame.groupby来计算由'id' DataFrame.groupby列分组的列之间的累计和。 Something like: 就像是：

import pandas as pd
data=pd.DataFrame({'id':[1,1,1,2,2,3,3,3],'var1':[1,2,3,4,5,6,7,8],'var2':[11,12,13,14,15,16,17,18], 'var3':[21,22,23,24,25,26,27,28]})
data.groupby('id').apply(lambda x: x.drop('id', axis=1).cumsum(axis=1).sum())

Answer 2

I am not familiar with the pd object's identity that you have used, but the way I understand your question is you have a list of labels (denoted id in your code) that correspond to several lists of equal length (denoted var1 , var2 , and var3 in your code), and that you want to sum the items sharing the same label, doing this for each label, and return the result. 我不熟悉您使用的pd对象的身份，但是据我了解您的问题的方式是，您有一个标签列表（在代码中以id表示），它们对应于多个等长列表（表示为var1 ， var2和var3 ），并且您希望对共享相同标签的项目求和，对每个标签执行此操作，然后返回结果。

The following code solves the general problem (assuming your array of labels is sorted): 以下代码解决了一般性问题（假设标签数组已排序）：

def cumsum(A):
 from operator import add
 return reduce(add, A) # cumulative sum of array A

def cumsumlbl(A, lbl):
 idx = [lbl.index(item) for item in set(lbl)] # begin index of each lbl subsequence
 idx.append(len(lbl)) # last index doesn't get added in the above line

 return [cumsum(A[i:j]) for (i,j) in zip(idx[:-1], idx[1:])]

Or to use a modified version of Markus Jarderot 's code that appears here : 或者使用出现在此处的Markus Jarderot代码的修改版本：

def cumsum(A):
 from operator import add
 return reduce(add, A)

def doublet(iterable):
 iterator = iter(iterable)
 item = iterator.next()
 for next in iterator:
  yield (item,next)
  item = next

def cumsumlbl(A, lbl):
 idx = [lbl.index(item) for item in set(lbl)]
 idx.append(len(lbl))
 dbl = doublet(idx) # generator for successive, overlapping pairs of indices

 return [cumsum(A[i:j]) for (i,j) in dbl]

And to test: 并测试：

if __name__ == '__main__'
 A = [1, 2, 3, 4, 5, 6]
 lbl = [1, 1, 2, 2, 2, 3]
 print cumsumlbl(A, lbl)

Output: 输出：

[3, 12, 6]

如何在Python中为多个列按累积和计算组

问题描述

2 个解决方案

解决方案1
1 已采纳 2015-03-17 07:38:54

解决方案2
1 2015-03-17 08:46:36

如何在Python中为多个列按累积和计算组

问题描述

2 个解决方案

解决方案1 1 已采纳 2015-03-17 07:38:54

解决方案2 1 2015-03-17 08:46:36

解决方案1
1 已采纳 2015-03-17 07:38:54

解决方案2
1 2015-03-17 08:46:36