如何计算熊猫数据框中列值更改的频率

Question

I have a pandas data frame like this: 我有一个这样的熊猫数据框：

    id  some_value
0   tag1    v1
1   tag1    v2
2   tag1    v1
3   tag2    v2
4   tag2    v2
5   tag2    v3

and I would like to know how often for each id the value in some_value changed. 而且我想知道some_value每个ID更改的some_value 。 So for tag1 that would be twice (because it changes first from v1 to v2 and then back), for tag2 it would be once. 因此，对于tag1 ，它将是两次（因为它首先从v1变为v2 ，然后又变回），对于tag2 ，它将是一次。 I have solved the problem like this: 我已经解决了这样的问题：

import pandas as pd
df = pd.DataFrame({'id': ['tag1', 'tag1', 'tag1', 'tag2', 'tag2','tag2'], 'some_value': ['v1','v2','v1','v2','v2','v3']})
mask = df['id'] == df['id'].shift(-1)
df['changed'] = df['some_value'] != df['some_value'].shift(-1)
df[mask].groupby('id').sum()

The code works fine in that it returns 该代码可以正常工作，因为它返回

    changed
id  
tag1    2.0
tag2    1.0

Is there a more elegant solution to this? 有没有更优雅的解决方案？

Answer 1

One way to achieve this would be: 实现此目的的一种方法是：

def numChanges(x):
    return sum(x.iloc[:-1] != x.shift(-1).iloc[:-1])

df.groupby('id').agg({
    'some_value' : numChanges
})

Please note that if the id column is unsorted, the results would differ, so your solution may produce incorrect results, unless you intend it to be that way. 请注意，如果id列未排序，结果将有所不同，因此，除非您打算这样做，否则您的解决方案可能会产生不正确的结果。

As an example, below dataset would yield tag2 value as 5 with my solution, but 3 as per yours. 例如，在我的解决方案中，下面的数据集将产生tag2值为5，但根据您的结果为3。 Technically, the correct answer would be 5, but if your id variable is sorted, it will not make any difference. 从技术上讲，正确的答案是5，但是如果对id变量进行排序，则不会有任何区别。

pd.concat([df]*3)  #My solution outputs 5 changes for tag2 and yours will give 3 only

如何计算熊猫数据框中列值更改的频率

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-03-31 17:21:34

如何计算熊猫数据框中列值更改的频率

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-03-31 17:21:34

解决方案1
1 已采纳 2019-03-31 17:21:34