[英]How do I count how often a column value changes in a pandas dataframe
I have a pandas data frame like this: 我有一个这样的熊猫数据框:
id some_value
0 tag1 v1
1 tag1 v2
2 tag1 v1
3 tag2 v2
4 tag2 v2
5 tag2 v3
and I would like to know how often for each id the value in some_value
changed. 而且我想知道
some_value
每个ID更改的some_value
。 So for tag1
that would be twice (because it changes first from v1
to v2
and then back), for tag2
it would be once. 因此,对于
tag1
,它将是两次(因为它首先从v1
变为v2
,然后又变回),对于tag2
,它将是一次。 I have solved the problem like this: 我已经解决了这样的问题:
import pandas as pd
df = pd.DataFrame({'id': ['tag1', 'tag1', 'tag1', 'tag2', 'tag2','tag2'], 'some_value': ['v1','v2','v1','v2','v2','v3']})
mask = df['id'] == df['id'].shift(-1)
df['changed'] = df['some_value'] != df['some_value'].shift(-1)
df[mask].groupby('id').sum()
The code works fine in that it returns 该代码可以正常工作,因为它返回
changed
id
tag1 2.0
tag2 1.0
Is there a more elegant solution to this? 有没有更优雅的解决方案?
One way to achieve this would be: 实现此目的的一种方法是:
def numChanges(x):
return sum(x.iloc[:-1] != x.shift(-1).iloc[:-1])
df.groupby('id').agg({
'some_value' : numChanges
})
Please note that if the id column is unsorted, the results would differ, so your solution may produce incorrect results, unless you intend it to be that way. 请注意,如果id列未排序,结果将有所不同,因此,除非您打算这样做,否则您的解决方案可能会产生不正确的结果。
As an example, below dataset would yield tag2 value as 5 with my solution, but 3 as per yours. 例如,在我的解决方案中,下面的数据集将产生tag2值为5,但根据您的结果为3。 Technically, the correct answer would be 5, but if your id variable is sorted, it will not make any difference.
从技术上讲,正确的答案是5,但是如果对id变量进行排序,则不会有任何区别。
pd.concat([df]*3) #My solution outputs 5 changes for tag2 and yours will give 3 only
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.