I want to optimize code which regroup my pandas dataframe (dk) by joins:
dk = pd.DataFrame({'Point': {0: 15, 1: 16, 2: 16, 3: 17, 4: 17, 5: 18, 6: 18, 7: 19, 8: 20},
'join': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4}})
If there two groups with differense joins have one same point, set to both groups one join. And so for all dataframe. I did it with simple code:
dk['new'] = dk['join']
for i in dk.index:
for j in range(i+1, dk.shape[0]):
if dk['Point'][i] == dk['Point'][j]:
dk['new'][j] = dk['join'][i]
dk.loc[(dk['join'] == dk['join'][j]), 'new'] = dk['new'][i]
Result that I want:
df = {'Point': {0: 15, 1: 16, 2: 16, 3: 17, 4: 17, 5: 18, 6: 18, 7: 19, 8: 20},
'join': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4},
'new': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 4}}
But I need to release it for big data which has more than 450k rows. Do you have any idea how to optimize it or other modules for this problem? (Beforehand thanks)
You can iterate over the sub-df grouped by 'join' and increment new
when the intersection with the previous 'Point' values is empty (I don't know if 'Point' is always increasing but that would cover the case where it's not):
df = pd.DataFrame()
new = None
point_set = {}
for j, sub_df in dk.groupby('join'):
if new == None or not set(sub_df['Point']).intersection(point_set):
new = j
point_set = set(sub_df['Point'])
sub_df['new'] = new
df = pd.concat([df, sub_df])
print(df)
Output:
Point join new
0 15 0 0
1 16 0 0
2 16 1 0
3 17 1 0
4 17 2 0
5 18 2 0
6 18 3 0
7 19 3 0
8 20 4 4
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.