[英]Count how many times a value of a column changes for more than n consecutive times, together with the changes, with group by, and condition in pandas
I have a pandas
dataframe:我有一个pandas
dataframe:
import pandas as pd
foo = pd.DataFrame({'id': ['a','a','a','a','a','b','b','b','b','b', 'c','c','c','c'],
'week': [1,2,3,4,5,3,4,5,6,7,1,2,3,4],
'col': [1,1,2,2,1,4,3,3,3,4, 6,6,7,7],
'confidence': ['h','h','h','l','h','h','h','h','h','h', 'h','h','l','l']})
I want to count how many times ( n_changes
) the value of col
changes (together with the previous value( from
) and the new value( to
)), only if the new value appears more than or equal n
consecutive times and there is at least one 'h'
in these n
consecutive times.我想计算col
的值变化了多少次( n_changes
)(连同前一个值( from
)和新值( to
)),只有当新值连续出现大于或等于n
次并且有在这n
个连续时间中至少有一个'h'
。 I want to do that by id
我想通过id
做到这一点
In case n=3
, the output should look something like this:如果n=3
, output 应该如下所示:
id from to n_changes
b 4 3 1
because:因为:
b
, 3
appears after 4
3 times or more, and in these 3 or more consecutive times
there is at least one h
对于b
, 3
出现在4
3 次或更多次之后,并且在这3 or more consecutive times
出现至少一个h
In case n=2
, the output should look something like this:如果n=2
, output 应该如下所示:
id from to n
a 1 2 1
b 4 3 1
because:因为:
a
, 2
appears after 1
2 times or more, and in these 2 or more consecutive times
there is at least one h
对于a
, 2
出现在1
2 次或更多次之后,并且在这2 or more consecutive times
出现至少一个h
b
, 3
appears after 4
2 times or more, and in these 2 or more consecutive times
there is at least one h
对于b
, 3
出现在4
2 次或更多次之后,并且在这2 or more consecutive times
出现至少一个h
c
does not appear in the output, because even though 7
appears 2 or more consecutive times
after 6
, there is not at least one h
in these 2 or more consecutive times
c
没有出现在 output 中,因为即使7
在6
之后连续出现2 or more consecutive times
,在这2 or more consecutive times
中也没有至少 1 h
Is there a way to achieve this?有没有办法做到这一点? Any ideas?有任何想法吗?
UPDATE更新
I have tried this for n=2
我已经尝试过n=2
test['next_col'] = test.groupby(['id'])['col'].transform('shift', periods=-1)
test['next_next_col'] = test.groupby(['id'])['col'].transform('shift', periods=-2)
test['next_confidence'] = test.groupby(['id'])['confidence'].transform('shift', periods=-1)
test['next_next_confidence'] = test.groupby(['id'])['confidence'].transform('shift', periods=-2)
test['n_h'] = (test['next_confidence'] == 'h').apply(lambda x: int(x)) + (test['next_next_confidence'] == 'h').apply(lambda x: int(x))
final_test = test[test.eval('next_col == next_next_col and n_h > =1 and col!= next_col')]
final_test['helper'] = 1
final_test['n'] = final_test.groupby(['id','col','next_col'])['helper'].transform('sum')
final_test[['id','col','next_col', 'n']].rename(columns={'col': 'from',
'next_col': 'to'})
which gives as output给出 output
id from to n
1 a 1 2.0 1
5 b 4 3.0 1
which is correct.哪个是对的。 But is there a more efficient way of doing it?但是有没有更有效的方法呢?
Here is a way to do this.这是一种方法。 The key idea is to establish a run_no
value that identifies each runs of consecutive col
values (within a given id
).关键思想是建立一个run_no
值来标识连续col
值的每次运行(在给定的id
内)。 Note that there is no groupby(...).apply(some_python_function)
, and thus is likely to be quite fast even on large df
.请注意,没有groupby(...).apply(some_python_function)
,因此即使在大df
上也可能相当快。
# first, let's establish a "run_no" which is distinct for each
# run of same 'col' for a given 'id'.
# we also set a 'is_h' for later .any() operation, plus a few useful columns:
cols = ['id', 'col']
z = df.assign(
from_=df.groupby('id')['col'].shift(1, fill_value=-1),
to=df['col'],
run_no=(df[cols] != df[cols].shift(1)).any(axis=1).cumsum(),
is_h=df['confidence'] == 'h')
# next, make a mask that selects the rows we are interested in
gb = z.groupby(['id', 'run_no'])
mask = (gb.size() >= n) & (gb['is_h'].any() & (gb.first()['from_'] != -1))
# finally, we select according to that mask, and add n_changes:
out = gb.first().loc[mask].reset_index()
out = out.assign(n_changes=out.groupby(['id', 'from_', 'to']).size().values)[['id', 'from_', 'to', 'n_changes']]
Outcome, with n = 2
:结果, n = 2
:
>>> out
id from_ to n_changes
0 a 1 2 1
1 b 4 3 1
And with n = 1
:并且n = 1
:
>>> out
id from_ to n_changes
0 a 1 2 1
1 a 2 1 1
2 b 4 3 1
3 b 3 4 1
Note: if you are interested in the intermediary values, you may of course inspect z
(which is independent of n
) and mask
(which is dependent on n
).注意:如果您对中间值感兴趣,您当然可以检查z
(独立于n
)和mask
(取决于n
)。 For example, for z
:例如,对于z
:
>>> z
id week col confidence from_ to run_no is_h
0 a 1 1 h -1 1 1 True
1 a 2 1 h 1 1 1 True
2 a 3 2 h 1 2 2 True
3 a 4 2 l 2 2 2 False
4 a 5 1 h 2 1 3 True
5 b 3 4 h -1 4 4 True
6 b 4 3 h 4 3 5 True
7 b 5 3 h 3 3 5 True
8 b 6 3 h 3 3 5 True
9 b 7 4 h 3 4 6 True
10 c 1 6 h -1 6 7 True
11 c 2 6 h 6 6 7 True
12 c 3 7 l 6 7 8 False
13 c 4 7 l 7 7 8 False
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.