简体   繁体   English

使用Groupby在Pandas Dataframe中标识连续的相同值

[英]Identify consecutive same values in Pandas Dataframe, with a Groupby

I have the following dataframe df: 我有以下数据帧df:

data={'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
      'value':[2,2,3,2,2,2,3,3,3,3,1,4,1,1,1,4,4,1,1,1,1,1]}
df=pd.DataFrame.from_dict(data)
df
Out[8]: 
    id  value
0    1      2
1    1      2
2    1      3
3    1      2
4    1      2
5    1      2
6    1      3
7    1      3
8    1      3
9    1      3
10   2      1
11   2      4
12   2      1
13   2      1
14   2      1
15   2      4
16   2      4
17   2      1
18   2      1
19   2      1
20   2      1
21   2      1

What I need to do is identify at the id level (df.groupby['id']) when the value shows the same number consecutively for 3 or more times. 我需要做的是在id级别(df.groupby ['id'])识别,当值连续显示相同的数字达3次或更多次时。

I would like to have the following result for the above: 我希望以上结果如下:

df
Out[12]: 
    id  value  flag
0    1      2     0
1    1      2     0
2    1      3     0
3    1      2     1
4    1      2     1
5    1      2     1
6    1      3     1
7    1      3     1
8    1      3     1
9    1      3     1
10   2      1     0
11   2      4     0
12   2      1     1
13   2      1     1
14   2      1     1
15   2      4     0
16   2      4     0
17   2      1     1
18   2      1     1
19   2      1     1
20   2      1     1
21   2      1     1

I have tried variations of groupby and lambda using pandas rolling.mean to identify where the average of the rolling period is then compared to the 'value', and where they are the same this indicates a flag. 我尝试使用pandas rolling.mean来测试groupby和lambda的变体,以确定滚动周期的平均值然后与“值”进行比较,并且它们相同则表示标记。 But this has several problems, including that you could have different values that will average to the value you are trying to flag. 但是这有几个问题,包括你可能有不同的值,它们将平均值到你想要标记的值。 Also, I can't figure out how to 'flag' all of the values of the rolling mean that created the initial flag. 此外,我无法弄清楚如何“标记”创建初始标志的滚动平均值的所有值。 See here, this identifies the 'right side' of the flag, but then I need to fill the previous values of the rolling mean length. 看到这里,这标识了标志的“右侧”,但是我需要填充滚动平均长度的先前值。 See my code here: 在这里查看我的代码:

test=df.copy()
test['rma']=test.groupby('id')['value'].transform(lambda x: x.rolling(min_periods=3,window=3).mean())
test['flag']=np.where(test.rma==test.value,1,0)

And the result here: 结果如下:

test
Out[61]: 
    id  value       rma  flag
0    1      2       NaN     0
1    1      2       NaN     0
2    1      3  2.333333     0
3    1      2  2.333333     0
4    1      2  2.333333     0
5    1      2  2.000000     1
6    1      3  2.333333     0
7    1      3  2.666667     0
8    1      3  3.000000     1
9    1      3  3.000000     1
10   2      1       NaN     0
11   2      4       NaN     0
12   2      1  2.000000     0
13   2      1  2.000000     0
14   2      1  1.000000     1
15   2      4  2.000000     0
16   2      4  3.000000     0
17   2      1  3.000000     0
18   2      1  2.000000     0
19   2      1  1.000000     1
20   2      1  1.000000     1
21   2      1  1.000000     1

Can't wait to see what I am missing! 迫不及待地想看看我错过了什么! Thanks 谢谢

You can try this; 你可以试试这个; 1) Create an extra group variable with df.value.diff().ne(0).cumsum() to denote the value changes; 1)用df.value.diff().ne(0).cumsum() )创建一个额外的组变量来表示值的变化; 2) use transform('size') to calculate the group size and compare with three, then you get the flag column you need: 2)使用transform('size')计算组大小并与3进行比较,然后获得所需的flag列:

df['flag'] = df.value.groupby([df.id, df.value.diff().ne(0).cumsum()]).transform('size').ge(3).astype(int) 
df

在此输入图像描述


Break downs : 分解

1) diff is not equal to zero (which is literally what df.value.diff().ne(0) means) gives a condition True whenever there is a value change: 1) diff不等于零 (字面意思是df.value.diff().ne(0)意味着)只要有值发生变化就给出条件True

df.value.diff().ne(0)
#0      True
#1     False
#2      True
#3      True
#4     False
#5     False
#6      True
#7     False
#8     False
#9     False
#10     True
#11     True
#12     True
#13    False
#14    False
#15     True
#16    False
#17     True
#18    False
#19    False
#20    False
#21    False
#Name: value, dtype: bool

2) Then cumsum gives a non descending sequence of ids where each id denotes a consecutive chunk with same values, note when summing boolean values, True is considered as one while False is considered as zero: 2)然后cumsum给出一个非降序的id序列,其中每个id表示一个具有相同值的连续块,注意在求和布尔值时, True被认为是1,而False被认为是0:

df.value.diff().ne(0).cumsum()
#0     1
#1     1
#2     2
#3     3
#4     3
#5     3
#6     4
#7     4
#8     4
#9     4
#10    5
#11    6
#12    7
#13    7
#14    7
#15    8
#16    8
#17    9
#18    9
#19    9
#20    9
#21    9
#Name: value, dtype: int64

3) combined with id column, you can group the data frame, calculate the group size and get the flag column. 3)结合id列,可以对数据帧进行分组,计算组大小并获取flag列。

See EDIT2 for a more robust solution 请参阅EDIT2以获得更强大的解决方案

Same result, but a little bit faster: 结果相同,但速度要快一些:

labels = (df.value != df.value.shift()).cumsum()
df['flag'] = (labels.map(labels.value_counts()) >= 3).astype(int)

    id  value  flag
0    1      2     0
1    1      2     0
2    1      3     0
3    1      2     1
4    1      2     1
5    1      2     1
6    1      3     1
7    1      3     1
8    1      3     1
9    1      3     1
10   2      1     0
11   2      4     0
12   2      1     1
13   2      1     1
14   2      1     1
15   2      4     0
16   2      4     0
17   2      1     1
18   2      1     1
19   2      1     1
20   2      1     1
21   2      1     1

Where: 哪里:

  1. df.value != df.value.shift() gives the value change df.value != df.value.shift()给出值的变化
  2. cumsum() creates "labels" for each group of same value cumsum()为每个具有相同值的组创建“标签”
  3. labels.value_counts() counts the occurrences of each label labels.value_counts()计算每个标签的出现次数
  4. labels.map(...) replaces labels by the counts computed above labels.map(...)用上面计算的计数替换标签
  5. >= 3 creates a boolean mask on count value >= 3在计数值上创建一个布尔掩码
  6. astype(int) casts the booleans to int astype(int)将布尔值转换为int

In my hands it give 1.03ms on your df, compared to 2.1ms for Psidoms' approach. 在我的手中它给你的df 1.03ms,而Psidoms的方法为2.1ms。 But mine is not one-liner. 但我的不是单行。


EDIT: 编辑:

A mix between both approaches is even faster 两种方法之间的混合甚至更快

labels = df.value.diff().ne(0).cumsum()
df['flag'] = (labels.map(labels.value_counts()) >= 3).astype(int)

Gives 911µs with your sample df. 样品df给出911μs。


EDIT2: correct solution to account for id change, as pointed by @clg4 EDIT2:正确的解决方案来解释id更改,正如@ clg4所指出的那样

labels = (df.value.diff().ne(0) | df.id.diff().ne(0)).cumsum()
df['flag'] = (labels.map(labels.value_counts()) >= 3).astype(int)

Where ... | df.id.diff().ne(0) 哪里... | df.id.diff().ne(0) ... | df.id.diff().ne(0) increment the label where the id changes ... | df.id.diff().ne(0)增加id变化的标签

This works even with same value on id change (tested with value 3 on index 10) and takes 1.28ms 这甚至在id更改时使用相同的值(在索引10上使用值3进行测试)并且需要1.28ms

EDIT3: Better explanations 编辑3:更好的解释

Take the case where index 10 has value 3. df.id.diff().ne(0) 以索引10的值为3的情况df.id.diff().ne(0)

data={'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
      'value':[2,2,3,2,2,2,3,3,3,3,3,4,1,1,1,4,4,1,1,1,1,1]}
df=pd.DataFrame.from_dict(data)

df['id_diff'] = df.id.diff().ne(0).astype(int)
df['val_diff'] = df.value.diff().ne(0).astype(int)
df['diff_or'] = (df.id.diff().ne(0) | df.value.diff().ne(0)).astype(int)
df['labels'] = df['diff_or'].cumsum()

     id  value  id_diff  val_diff  diff_or  labels
 0    1      2        1         1        1       1
 1    1      2        0         0        0       1
 2    1      3        0         1        1       2
 3    1      2        0         1        1       3
 4    1      2        0         0        0       3
 5    1      2        0         0        0       3
 6    1      3        0         1        1       4
 7    1      3        0         0        0       4
 8    1      3        0         0        0       4
 9    1      3        0         0        0       4
>10   2      3        1    |    0    =   1       5 <== label increment
 11   2      4        0         1        1       6
 12   2      1        0         1        1       7
 13   2      1        0         0        0       7
 14   2      1        0         0        0       7
 15   2      4        0         1        1       8
 16   2      4        0         0        0       8
 17   2      1        0         1        1       9
 18   2      1        0         0        0       9
 19   2      1        0         0        0       9
 20   2      1        0         0        0       9
 21   2      1        0         0        0       9

The | | is operator "bitwise-or", which gives True as long as one of the elements is True . 是算子“按位或”,其给出True只要元件中的一个是True So if there is no diff in value where the id changes, the | 因此,如果id没有变化的值,那么| reflects the id change. 反映了id的变化。 Otherwise it changes nothing. 否则它什么都没改变。 When .cumsum() is performed, the label is incremented where the id changes, so the value 3 at index 10 is not grouped with values 3 from indexes 6-9. 执行.cumsum() ,标签会在id更改的位置递增,因此索引10处的值3不会与索引6-9中的值3分组。

#try this simpler version
a= pd.Series([1,1,1,2,3,4,5,5,5,7,8,0,0,0])
b= a.groupby([a.ne(0), a]).transform('size').ge(3).astype('int')
#ge(x) <- x is the number of consecutive repeated values 
print b
df=pd.DataFrame.from_dict(
        {'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
         'value':[2,2,3,2,2,2,3,3,3,3,1,4,1,1,1,4,4,1,1,1,1,1]})

df2 = df.groupby((df['value'].shift() != df['value']).\
                cumsum()).filter(lambda x: len(x) >= 3)

df['flag'] = np.where(df.index.isin(df2.index),1,0)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM