简体   繁体   中英

How to optimize regrouping code for dataframe

I want to optimize code which regroup my pandas dataframe (dk) by joins:

dk = pd.DataFrame({'Point': {0: 15, 1: 16, 2: 16, 3: 17, 4: 17, 5: 18, 6: 18, 7: 19, 8: 20},
                   'join': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4}})

If there two groups with differense joins have one same point, set to both groups one join. And so for all dataframe. I did it with simple code:

dk['new'] = dk['join']
for i in dk.index:
    
    for j in range(i+1, dk.shape[0]):
        if dk['Point'][i] == dk['Point'][j]:
            dk['new'][j] = dk['join'][i]
            dk.loc[(dk['join'] == dk['join'][j]), 'new'] = dk['new'][i]   

Result that I want:

df = {'Point': {0: 15, 1: 16, 2: 16, 3: 17, 4: 17, 5: 18, 6: 18, 7: 19, 8: 20},
 'join': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4},
 'new': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 4}}

But I need to release it for big data which has more than 450k rows. Do you have any idea how to optimize it or other modules for this problem? (Beforehand thanks)

You can iterate over the sub-df grouped by 'join' and increment new when the intersection with the previous 'Point' values is empty (I don't know if 'Point' is always increasing but that would cover the case where it's not):

df = pd.DataFrame()
new = None
point_set = {}
for j, sub_df in dk.groupby('join'):
    if new == None or not set(sub_df['Point']).intersection(point_set):
        new = j

    point_set = set(sub_df['Point'])
    sub_df['new'] = new
    df = pd.concat([df, sub_df])
    
print(df)

Output:

   Point  join  new
0     15     0    0
1     16     0    0
2     16     1    0
3     17     1    0
4     17     2    0
5     18     2    0
6     18     3    0
7     19     3    0
8     20     4    4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM