I have a DataFrame as following:
id class
A 1
B 1
C 0
D 0
E 1
F 1
I want to group it into 3 groups, G1:A,B, G2:C,D, G3:E,F. Is there a way to do so with looping over all the rows to assign a new class for each id?
You can use diff
, astype
and cumsum
:
print df
id class
0 A 0
1 B 1
2 B1 1
3 C 0
4 D 0
5 E 1
6 F 1
7 F1 1
8 G 0
9 H 0
10 I 1
11 J 1
df['count'] = (df['class'].diff(1) != 0).astype('int').cumsum()
print df
id class count
0 A 0 1
1 B 1 2
2 B1 1 2
3 C 0 3
4 D 0 3
5 E 1 4
6 F 1 4
7 F1 1 4
8 G 0 5
9 H 0 5
10 I 1 6
11 J 1 6
for name, group in df.groupby('count'):
print name
print group[['id', 'class']]
Testing performance:
These timings are going to be very dependent on the size of df as well as the number (and position) of 0
and 1
):
import pandas as pd
df = pd.DataFrame({'id': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], 'class': [0, 1, 1, 0, 0, 1, 1, 1, 0, 0]}, columns=['id', 'class'])
#uncomment for test len(df) = 1000
#df = pd.concat([df]*1000).reset_index(drop=True)
def jez(df):
df['count'] = (df['class'].diff(1) != 0).astype('int').cumsum()
return df
def eze(df):
group_index = [0]
for i in df.index[1:]:
if df['class'][i]==df['class'][i-1]:
group_index.append(group_index[-1])
else:
group_index.append(group_index[-1]+1)
df['group_index'] = group_index
return df
def sy2(df):
df = pd.concat([df, pd.Series(map(lambda x: 1 if abs(x) > 0 else 0, df['class'].diff().fillna(0)), name='groupid').cumsum()], axis=1)
return df
print jez(df)
print eze(df)
print sy2(df)
Test len(df) = 10
:
In [28]: %timeit jez(df)
The slowest run took 5.08 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 454 µs per loop
In [29]: %timeit eze(df)
The slowest run took 4.83 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 422 µs per loop
In [30]: %timeit sy2(df)
The slowest run took 4.57 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 1.46 ms per loop
Test len(df) = 10000
:
In [32]: %timeit jez(df)
The slowest run took 4.78 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 543 µs per loop
In [33]: %timeit eze(df)
1 loops, best of 3: 245 ms per loop
In [34]: %timeit sy2(df)
The slowest run took 4.11 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 9.11 ms per loop
Iterate through 'class' and start a new group every time the class is not the same as the previous one, for the example:
Crete the DF:
import pandas as pd
df = pd.DataFrame()
df['id'] = ['a','b','c','d','e','f']
df['class'] = [1,1,0,0,1,1]
Iterate through 'class' to create the groups index:
group_index = [0]
for i in df.index[1:]:
if df['class'][i]==df['class'][i-1]:
group_index.append(group_index[-1])
else:
group_index.append(group_index[-1]+1)
Add the group_index to the DF:
df['group_index'] = group_index
and the output should be:
id class group_index
0 a 1 0
1 b 1 0
2 c 0 1
3 d 0 1
4 e 1 2
5 f 1 2
Here is a one-liner code. :p It utilizes differential information of adjacent rows and cumulative summation to assign group ids for each row.
>>> df = pd.DataFrame({'id': ['A','B','C','D','E','F'],
'class': [1, 1, 0, 0, 1, 1]},
columns=['id', 'class'])
>>> pd.concat([df, pd.Series(map(lambda x: 1 if abs(x) > 0 else 0,
df['class'].diff().fillna(0)), name='groupid').cumsum()], axis=1)
id class groupid
0 A 1 0
1 B 1 0
2 C 0 1
3 D 0 1
4 E 1 2
5 F 1 2
Now, you can use groupby() to obtain groupy object.
>>> g = pd.concat([df, pd.Series(map(lambda x: 1 if abs(x) > 0 else 0,
df['class'].diff().fillna(0)), name='groupid').cumsum()], axis=1).groupby('groupid')
>>> for index, group_df in g:
print(group_df)
id class groupid
0 A 1 0
1 B 1 0
id class groupid
2 C 0 1
3 D 0 1
id class groupid
4 E 1 2
5 F 1 2
The complete code is attached.
import pandas as pd
def groupby_binaryflag(df, key='class'):
return pd.concat([df,
pd.Series(map(lambda x: 1
if abs(x) > 0
else 0, df['class'].diff().fillna(0)),
name='groupid').cumsum()], axis=1).groupby('groupid')
if __name__ == '__main__':
df1 = pd.DataFrame({'id': ['A','B','C','D','E','F'],
'class': [1, 1, 0, 0, 1, 1]}, columns=['id', 'class'])
df2 = pd.DataFrame({'id': ['A','B','C','D','E','F', 'G', 'H', 'I', 'J', 'K', 'L'],
'class': [1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1]}, columns=['id', 'class'])
for df in [df1, df2]:
for index, group_df in groupby_binaryflag(df):
print(group_df)
print("=====\n")
Output:
id class groupid
0 A 1 0
1 B 1 0
id class groupid
2 C 0 1
3 D 0 1
id class groupid
4 E 1 2
5 F 1 2
=====
id class groupid
0 A 1 0
1 B 1 0
id class groupid
2 C 0 1
3 D 0 1
id class groupid
4 E 1 2
5 F 1 2
id class groupid
6 G 0 3
7 H 0 3
8 I 0 3
id class groupid
9 J 1 4
10 K 1 4
11 L 1 4
=====
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.