简体   繁体   English

计算一个列的值连续变化n次以上的次数,连同pandas中的变化、分组依据和条件

[英]Count how many times a value of a column changes for more than n consecutive times, together with the changes, with group by, and condition in pandas

I have a pandas dataframe:我有一个pandas dataframe:

import pandas as pd
foo = pd.DataFrame({'id': ['a','a','a','a','a','b','b','b','b','b', 'c','c','c','c'], 
                'week': [1,2,3,4,5,3,4,5,6,7,1,2,3,4],
                'col': [1,1,2,2,1,4,3,3,3,4, 6,6,7,7],
                'confidence': ['h','h','h','l','h','h','h','h','h','h', 'h','h','l','l']})

I want to count how many times ( n_changes ) the value of col changes (together with the previous value( from ) and the new value( to )), only if the new value appears more than or equal n consecutive times and there is at least one 'h' in these n consecutive times.我想计算col的值变化了多少次( n_changes )(连同前一个值( from )和新值( to )),只有当新值连续出现大于或等于n并且有在这n个连续时间中至少有一个'h' I want to do that by id我想通过id做到这一点

In case n=3 , the output should look something like this:如果n=3 , output 应该如下所示:

 id from to n_changes
 b  4    3  1

because:因为:

  • for b , 3 appears after 4 3 times or more, and in these 3 or more consecutive times there is at least one h对于b3出现在4 3 次或更多次之后,并且在这3 or more consecutive times出现至少一个h

In case n=2 , the output should look something like this:如果n=2 , output 应该如下所示:

id from to n
a  1    2  1
b  4    3  1

because:因为:

  • for a , 2 appears after 1 2 times or more, and in these 2 or more consecutive times there is at least one h对于a , 2出现在1 2 次或更多次之后,并且在这2 or more consecutive times出现至少一个h
  • for b , 3 appears after 4 2 times or more, and in these 2 or more consecutive times there is at least one h对于b3出现在4 2 次或更多次之后,并且在这2 or more consecutive times出现至少一个h
  • c does not appear in the output, because even though 7 appears 2 or more consecutive times after 6 , there is not at least one h in these 2 or more consecutive times c没有出现在 output 中,因为即使76之后连续出现2 or more consecutive times ,在这2 or more consecutive times中也没有至少 1 h

Is there a way to achieve this?有没有办法做到这一点? Any ideas?有任何想法吗?

UPDATE更新

I have tried this for n=2我已经尝试过n=2

test['next_col'] = test.groupby(['id'])['col'].transform('shift', periods=-1)
test['next_next_col'] = test.groupby(['id'])['col'].transform('shift', periods=-2)
test['next_confidence'] = test.groupby(['id'])['confidence'].transform('shift', periods=-1)
test['next_next_confidence'] = test.groupby(['id'])['confidence'].transform('shift', periods=-2)
test['n_h'] = (test['next_confidence'] == 'h').apply(lambda x: int(x)) + (test['next_next_confidence'] == 'h').apply(lambda x: int(x))
final_test = test[test.eval('next_col == next_next_col and n_h > =1 and col!= next_col')]
final_test['helper'] = 1
final_test['n'] = final_test.groupby(['id','col','next_col'])['helper'].transform('sum')
final_test[['id','col','next_col', 'n']].rename(columns={'col': 'from',
                                                    'next_col': 'to'})

which gives as output给出 output

id  from    to  n
1   a   1   2.0 1
5   b   4   3.0 1

which is correct.哪个是对的。 But is there a more efficient way of doing it?但是有没有更有效的方法呢?

Here is a way to do this.这是一种方法。 The key idea is to establish a run_no value that identifies each runs of consecutive col values (within a given id ).关键思想是建立一个run_no值来标识连续col值的每次运行(在给定的id内)。 Note that there is no groupby(...).apply(some_python_function) , and thus is likely to be quite fast even on large df .请注意,没有groupby(...).apply(some_python_function) ,因此即使在大df上也可能相当快

# first, let's establish a "run_no" which is distinct for each
# run of same 'col' for a given 'id'.
# we also set a 'is_h' for later .any() operation, plus a few useful columns:

cols = ['id', 'col']
z = df.assign(
    from_=df.groupby('id')['col'].shift(1, fill_value=-1),
    to=df['col'],
    run_no=(df[cols] != df[cols].shift(1)).any(axis=1).cumsum(),
    is_h=df['confidence'] == 'h')

# next, make a mask that selects the rows we are interested in
gb = z.groupby(['id', 'run_no'])
mask = (gb.size() >= n) & (gb['is_h'].any() & (gb.first()['from_'] != -1))

# finally, we select according to that mask, and add n_changes:
out = gb.first().loc[mask].reset_index()
out = out.assign(n_changes=out.groupby(['id', 'from_', 'to']).size().values)[['id', 'from_', 'to', 'n_changes']]

Outcome, with n = 2 :结果, n = 2

>>> out
  id  from_  to  n_changes
0  a      1   2          1
1  b      4   3          1

And with n = 1 :并且n = 1

>>> out
  id  from_  to  n_changes
0  a      1   2          1
1  a      2   1          1
2  b      4   3          1
3  b      3   4          1

Note: if you are interested in the intermediary values, you may of course inspect z (which is independent of n ) and mask (which is dependent on n ).注意:如果您对中间值感兴趣,您当然可以检查z (独立于n )和mask (取决于n )。 For example, for z :例如,对于z

>>> z
   id  week  col confidence  from_  to  run_no   is_h
0   a     1    1          h     -1   1       1   True
1   a     2    1          h      1   1       1   True
2   a     3    2          h      1   2       2   True
3   a     4    2          l      2   2       2  False
4   a     5    1          h      2   1       3   True
5   b     3    4          h     -1   4       4   True
6   b     4    3          h      4   3       5   True
7   b     5    3          h      3   3       5   True
8   b     6    3          h      3   3       5   True
9   b     7    4          h      3   4       6   True
10  c     1    6          h     -1   6       7   True
11  c     2    6          h      6   6       7   True
12  c     3    7          l      6   7       8  False
13  c     4    7          l      7   7       8  False

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM