简体   繁体   中英

python pandas groupby multiple groups with binary class

I have a DataFrame as following:

id class
A   1
B   1
C   0 
D   0
E   1
F   1

I want to group it into 3 groups, G1:A,B, G2:C,D, G3:E,F. Is there a way to do so with looping over all the rows to assign a new class for each id?

You can use diff , astype and cumsum :

print df
    id  class
0    A      0
1    B      1
2   B1      1
3    C      0
4    D      0
5    E      1
6    F      1
7   F1      1
8    G      0
9    H      0
10   I      1
11   J      1

df['count'] = (df['class'].diff(1) != 0).astype('int').cumsum()
print df

    id  class  count
0    A      0      1
1    B      1      2
2   B1      1      2
3    C      0      3
4    D      0      3
5    E      1      4
6    F      1      4
7   F1      1      4
8    G      0      5
9    H      0      5
10   I      1      6
11   J      1      6

for name,  group in df.groupby('count'):
    print name
    print group[['id', 'class']]

Testing performance:

These timings are going to be very dependent on the size of df as well as the number (and position) of 0 and 1 ):

import pandas as pd

df = pd.DataFrame({'id': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], 'class': [0, 1, 1, 0, 0, 1, 1, 1, 0, 0]}, columns=['id', 'class'])

#uncomment for test len(df) = 1000
#df =  pd.concat([df]*1000).reset_index(drop=True)

def jez(df):
    df['count'] = (df['class'].diff(1) != 0).astype('int').cumsum()
    return df

def eze(df):
    group_index = [0]
    for i in df.index[1:]:
        if df['class'][i]==df['class'][i-1]:

    df['group_index'] = group_index        
    return df

def sy2(df):
    df = pd.concat([df, pd.Series(map(lambda x: 1 if abs(x) > 0 else 0, df['class'].diff().fillna(0)), name='groupid').cumsum()], axis=1)
    return df

print jez(df)
print eze(df)
print sy2(df)

Test len(df) = 10 :

In [28]: %timeit jez(df)
The slowest run took 5.08 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 454 µs per loop

In [29]: %timeit eze(df)
The slowest run took 4.83 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 422 µs per loop

In [30]: %timeit sy2(df)
The slowest run took 4.57 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 1.46 ms per loop

Test len(df) = 10000 :

In [32]: %timeit jez(df)
The slowest run took 4.78 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 543 µs per loop

In [33]: %timeit eze(df)
1 loops, best of 3: 245 ms per loop

In [34]: %timeit sy2(df)
The slowest run took 4.11 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 9.11 ms per loop

Iterate through 'class' and start a new group every time the class is not the same as the previous one, for the example:

Crete the DF:

import pandas as pd
df = pd.DataFrame()
df['id'] = ['a','b','c','d','e','f']
df['class'] = [1,1,0,0,1,1]

Iterate through 'class' to create the groups index:

group_index = [0]
for i in df.index[1:]:
    if df['class'][i]==df['class'][i-1]:

Add the group_index to the DF:

df['group_index'] = group_index

and the output should be:

    id  class   group_index
  0 a     1        0
  1 b     1        0
  2 c     0        1
  3 d     0        1
  4 e     1        2
  5 f     1        2

Here is a one-liner code. :p It utilizes differential information of adjacent rows and cumulative summation to assign group ids for each row.

>>> df = pd.DataFrame({'id': ['A','B','C','D','E','F'],
                       'class': [1, 1, 0, 0, 1, 1]},
                       columns=['id', 'class'])

>>> pd.concat([df, pd.Series(map(lambda x: 1 if abs(x) > 0 else 0,
df['class'].diff().fillna(0)), name='groupid').cumsum()], axis=1)

  id  class  groupid
0  A      1        0
1  B      1        0
2  C      0        1
3  D      0        1
4  E      1        2
5  F      1        2

Now, you can use groupby() to obtain groupy object.

>>> g = pd.concat([df, pd.Series(map(lambda x: 1 if abs(x) > 0 else 0,
df['class'].diff().fillna(0)), name='groupid').cumsum()], axis=1).groupby('groupid')

>>> for index, group_df in g:

  id  class  groupid
0  A      1        0
1  B      1        0
  id  class  groupid
2  C      0        1
3  D      0        1
  id  class  groupid
4  E      1        2
5  F      1        2

The complete code is attached.

import pandas as pd

def groupby_binaryflag(df, key='class'):
    return pd.concat([df,
                      pd.Series(map(lambda x: 1
                                    if abs(x) > 0
                                    else 0, df['class'].diff().fillna(0)),
                                name='groupid').cumsum()], axis=1).groupby('groupid')

if __name__ == '__main__':
    df1 = pd.DataFrame({'id': ['A','B','C','D','E','F'],
                        'class': [1, 1, 0, 0, 1, 1]}, columns=['id', 'class'])

    df2 = pd.DataFrame({'id': ['A','B','C','D','E','F', 'G', 'H', 'I', 'J', 'K', 'L'],
                        'class': [1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1]}, columns=['id', 'class'])

    for df in [df1, df2]:
        for index, group_df in groupby_binaryflag(df):


  id  class  groupid
0  A      1        0
1  B      1        0
  id  class  groupid
2  C      0        1
3  D      0        1
  id  class  groupid
4  E      1        2
5  F      1        2

  id  class  groupid
0  A      1        0
1  B      1        0
  id  class  groupid
2  C      0        1
3  D      0        1
  id  class  groupid
4  E      1        2
5  F      1        2
  id  class  groupid
6  G      0        3
7  H      0        3
8  I      0        3
   id  class  groupid
9   J      1        4
10  K      1        4
11  L      1        4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM