简体   繁体   English

根据python pandas数据帧中列的状态更改,将时间序列数据拆分为组

[英]Splitting time series data into groups based on a changes in state on a column in a python pandas dataframe

I need to group some data in a pandas dataframe but the standard grouping method does not quite work how I need it to. 我需要在pandas数据帧中对一些数据进行分组,但标准的分组方法并不能完全满足我的需求。 It must group so that each change in "loc" and/or each change in "name" is treated as a separate group. 它必须分组,以便“loc”中的每个更改和/或“name”中的每个更改都被视为一个单独的组。

Example; 例;

x = pd.DataFrame([['john','abc',1],['john','abc',2],['john','abc',3],['john','xyz',4],['john','xyz',5],['john','abc',6],['john','abc',7],['matt','abc',8]])
x.columns = ['name','loc','time']

name    loc  time
john    abc  1
john    abc  2
john    abc  3
john    xyz  4
john    xyz  5
john    abc  6
john    abc  7
matt    abc  8

I need to group these values so that the resulting data is 我需要对这些值进行分组,以便得到结果数据

name    loc  first last
john    abc  1     3
john    xyz  4     5
john    abc  6     7
matt    abc  8     8

The default grouping function (correctly) groups all the loc and name values so we are only left with 3 groups (john / abc is 1 group). 默认分组功能(正确)将所有loc和name值分组,因此我们只剩下3组(john / abc是1组)。 Does anybody know how the grouping can be forced to group how i require it to? 有人知道如何将分组强制分组我需要它吗?

I'm able to generate the required table using a for loop (iterrows), but if there is a nice pandas pythonic way to do the same thing I would love to know. 我能够使用for循环(iterrows)生成所需的表,但如果有一个很好的pandas pythonic方式来做同样的事情我很想知道。

Thank you in advance. 先感谢您。

Matt 马特

This is not really a job for groupby because the order of the rows matters. 这对于groupby来说实际​​上并不是一项工作,因为行的顺序很重要。 Instead, compare consecutive rows by using shift . 相反,使用shift比较连续的行。

In [37]: cols = ['name', 'loc']

In [38]: change = (x[cols] != x[cols].shift(-1)).any(1).shift(1).fillna(True)

In [39]: groups = x[change]

In [40]: groups.columns = ['name', 'loc', 'first']

In [41]: groups['last'] = (groups['first'].shift(-1) - 1).fillna(len(x))

In [42]: groups
Out[42]:
   name  loc  first  last
0  john  abc      1     3
3  john  xyz      4     5
5  john  abc      6     7
7  matt  abc      8     8

[4 rows x 4 columns]

You can use a function in the groupby : 您可以在groupby使用一个函数:

x = pd.DataFrame([['john','abc',1],['john','abc',2],['john','abc',3],['john','xyz',4],['john','xyz',5],['john','abc',6],['john','abc',7],['matt','abc',8]])
x.columns = ['name','loc','time']

last_group = None
c =0
def f(y):
    global c,last_group
    g = x.irow(y)['name'],x.irow(y)['loc']
    if last_group != g:
        c += 1
        last_group = g
    return c

print x.groupby(f).head()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM