[英]Splitting time series data into groups based on a changes in state on a column in a python pandas dataframe
I need to group some data in a pandas dataframe but the standard grouping method does not quite work how I need it to. 我需要在pandas数据帧中对一些数据进行分组,但标准的分组方法并不能完全满足我的需求。 It must group so that each change in "loc" and/or each change in "name" is treated as a separate group.
它必须分组,以便“loc”中的每个更改和/或“name”中的每个更改都被视为一个单独的组。
Example; 例;
x = pd.DataFrame([['john','abc',1],['john','abc',2],['john','abc',3],['john','xyz',4],['john','xyz',5],['john','abc',6],['john','abc',7],['matt','abc',8]])
x.columns = ['name','loc','time']
name loc time
john abc 1
john abc 2
john abc 3
john xyz 4
john xyz 5
john abc 6
john abc 7
matt abc 8
I need to group these values so that the resulting data is 我需要对这些值进行分组,以便得到结果数据
name loc first last
john abc 1 3
john xyz 4 5
john abc 6 7
matt abc 8 8
The default grouping function (correctly) groups all the loc and name values so we are only left with 3 groups (john / abc is 1 group). 默认分组功能(正确)将所有loc和name值分组,因此我们只剩下3组(john / abc是1组)。 Does anybody know how the grouping can be forced to group how i require it to?
有人知道如何将分组强制分组我需要它吗?
I'm able to generate the required table using a for loop (iterrows), but if there is a nice pandas pythonic way to do the same thing I would love to know. 我能够使用for循环(iterrows)生成所需的表,但如果有一个很好的pandas pythonic方式来做同样的事情我很想知道。
Thank you in advance. 先感谢您。
Matt 马特
This is not really a job for groupby
because the order of the rows matters. 这对于
groupby
来说实际上并不是一项工作,因为行的顺序很重要。 Instead, compare consecutive rows by using shift
. 相反,使用
shift
比较连续的行。
In [37]: cols = ['name', 'loc']
In [38]: change = (x[cols] != x[cols].shift(-1)).any(1).shift(1).fillna(True)
In [39]: groups = x[change]
In [40]: groups.columns = ['name', 'loc', 'first']
In [41]: groups['last'] = (groups['first'].shift(-1) - 1).fillna(len(x))
In [42]: groups
Out[42]:
name loc first last
0 john abc 1 3
3 john xyz 4 5
5 john abc 6 7
7 matt abc 8 8
[4 rows x 4 columns]
You can use a function in the groupby
: 您可以在
groupby
使用一个函数:
x = pd.DataFrame([['john','abc',1],['john','abc',2],['john','abc',3],['john','xyz',4],['john','xyz',5],['john','abc',6],['john','abc',7],['matt','abc',8]])
x.columns = ['name','loc','time']
last_group = None
c =0
def f(y):
global c,last_group
g = x.irow(y)['name'],x.irow(y)['loc']
if last_group != g:
c += 1
last_group = g
return c
print x.groupby(f).head()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.