简体   繁体   English

找出pandas数据帧中事件的中间出现“0”和第一次出现的“1”

[英]Find out middle occurrence of “0” and first occurrence ''1" of an event in pandas dataframe

Hi I have a pandas dataframe which has event columns and other columns as well. 嗨,我有一个pandas数据框,其中包含事件列和其他列。 I want to perform a group by on id and on that group by i want to take 2 records out of all continues 0s i want to find out a pattern of continues 5 0's could be more but it has to always followed by 1 as well and then identify set of records ie continues 5 0's and followed by next 1 then get middle row of (0s out of those 5 set of 0's) record and find out the first 1 after those 0's and take that row. 我希望通过id和on group执行一个小组,我希望从所有继续0中取出2个记录我想找出一个继续5 0的模式可能更多但是它必须始终跟随1以及然后识别一组记录,即继续5 0,然后是接下来的1,然后得到中间行(这5个0的0中的0)记录,找出那些0之后的第1个并取出那一行。 But for 0s alteast i should get repeated for 5 times or more then take mid row out of those last 5. 但对于0s替换我应该重复5次或更多,然后从最后5次中间排。

In short: I want the set of 0's and 1's and condition is take the 1's only for which above you find continues 5 0's or more, if this pattern is multiple time then take one pattern get two records for every id having 0's and 1's 简而言之:我想要0和1的集合,条件只取1的上面你找到的继续5 0或更多,如果这个模式是多次,那么采取一个模式得到两个记录每个id为0和1的

for eg. 例如。

 import pandas as pd
 data={'id':[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],
  'name': ['a','b','c','d','e','f','g','h','i','j','k','l','m','n'
          ,'o','p','q','r','s','t','a1','b1','c1','d1','e1','f1','g1','h1','i1','j1','k1','l1','m1','n1'
          ,'o1','p1','q1','r1','s1','t1','aa','bb','cc','dd','ee','ff',
          'gg','hh','ii','jj','kk','ll','mm','nn'
          ,'oo','pp','qq','rr','ss','tt','aa1','bb1','cc1','dd1','ee1','ff1',
          'gg1','hh1','ii1','jj1','kk1','ll1','mm1','nn1'
          ,'oo1','pp1','qq1','rr1','ss1','tt1'],
  'value':[0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
           0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0]}
 df=pd.DataFrame.from_dict(data)

As a output i want to get 2 records per id one for 0 and one for 1's. 作为输出,我希望每个id获得2个记录,一个用于0,一个用于1。 And 0 row should be middle records of 5 or more consecutive 0s. 并且0行应该是5个或更多个连续0的中间记录。

The expected output is: 预期的产出是:

    id  name    value

 16 1   q       0
 19 1   t       1

64  2   ee1     0
67  2   hh1     1

You can do it using pivot table and applying masks for the different values. 您可以使用数据透视表并为不同的值应用掩码。 First we should group by id , value pair: 首先我们应该按idvalue对分组:

df_grouped = df.reset_index().pivot_table(index=['id','value'],
                                          values='name',
                                          aggfunc=lambda x: ','.join(x)
                                          ).reset_index()


df_grouped['name'] = df_grouped['name'].str.split(',')

print(df_grouped)

   id  value             name
0   1      0  a,b,d,e,f,g,h,i
1   1      1              c,j
2   2      0        l,m,n,o,p
3   2      1    k,q,r,s,t,u,w

Then select the zeros per value==0 and id pair and keep the middle value: 然后选择每个value==0的零value==0id对并保持中间值:

mask_zeros = ((df_grouped['value']==0)*
              (df_grouped['name'].apply(len)>=5))
df_zeros = mask_zeros*df_grouped['name'].apply(
           lambda x: x[int(np.ceil(.5*len(x)))] 
                      if len(x)%2==1 
                      else x[int(.5*len(x))])
print(df_zeros)

0    f
1     
2    o
3     

And select the first name per value==1 and id pair: 并选择每个value==1的第一个名称value==1id对:

mask_ones = (df_grouped['value']==1)
df_ones = mask_ones*df_grouped['name'].apply(
           lambda x: x[0] if len(x)>0 else None)

print(df_ones)

0     
1    c
2     
3    k

Then keep only the selected names by assigning: 然后通过指定以下内容仅保留选定的名称:

 df_grouped['name'] = df_ones + df_zeros

 df_grouped = df_grouped.merge(df.reset_index(),
                               on=['name','value','id']
                               ).set_index('index')
 print(df_grouped)

       id  value name
index                
5       1      0    f
2       1      1    c
14      2      0    o
10      2      1    k

I break down the steps 我打破了台阶

df['New']=df.value.diff().fillna(0).ne(0).cumsum()
df1=df.loc[df.value.eq(0)]
s1=df1.groupby(['id','New']).filter(lambda x : len(x)>=5 ).groupby('id').apply(lambda x : x.iloc[len(x)//2-1:len(x)//2+1] if len(x)%2==0 else x.iloc[[(len(x)+1)//2],:] ).reset_index(level=0,drop=True)
s2=df1.groupby(['id','New']).filter(lambda x : len(x)>=5 )
pd.concat([df.loc[s2.drop_duplicates(['id'],keep='last').index+1],s1]).sort_index()
Out[1995]: 
    id name  value  New
5    1    f      0    2
6    1    g      0    2
9    1    j      1    3
14   2    o      0    4
16   2    q      1    5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM