How to optimize regrouping code for dataframe

Question

I want to optimize code which regroup my pandas dataframe (dk) by joins:

dk = pd.DataFrame({'Point': {0: 15, 1: 16, 2: 16, 3: 17, 4: 17, 5: 18, 6: 18, 7: 19, 8: 20},
                   'join': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4}})

If there two groups with differense joins have one same point, set to both groups one join. And so for all dataframe. I did it with simple code:

dk['new'] = dk['join']
for i in dk.index:
    
    for j in range(i+1, dk.shape[0]):
        if dk['Point'][i] == dk['Point'][j]:
            dk['new'][j] = dk['join'][i]
            dk.loc[(dk['join'] == dk['join'][j]), 'new'] = dk['new'][i]

Result that I want:

df = {'Point': {0: 15, 1: 16, 2: 16, 3: 17, 4: 17, 5: 18, 6: 18, 7: 19, 8: 20},
 'join': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4},
 'new': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 4}}

But I need to release it for big data which has more than 450k rows. Do you have any idea how to optimize it or other modules for this problem? (Beforehand thanks)

Answer 1

You can iterate over the sub-df grouped by 'join' and increment new when the intersection with the previous 'Point' values is empty (I don't know if 'Point' is always increasing but that would cover the case where it's not):

df = pd.DataFrame()
new = None
point_set = {}
for j, sub_df in dk.groupby('join'):
    if new == None or not set(sub_df['Point']).intersection(point_set):
        new = j

    point_set = set(sub_df['Point'])
    sub_df['new'] = new
    df = pd.concat([df, sub_df])
    
print(df)

Output:

   Point  join  new
0     15     0    0
1     16     0    0
2     16     1    0
3     17     1    0
4     17     2    0
5     18     2    0
6     18     3    0
7     19     3    0
8     20     4    4

How to optimize regrouping code for dataframe

Question

1 answers

solution1
0 2022-07-11 15:27:57

How to optimize regrouping code for dataframe

Question

1 answers

solution1 0 2022-07-11 15:27:57

solution1
0 2022-07-11 15:27:57