簡體   English   中英

如何優化數據框的重組代碼

[英]How to optimize regrouping code for dataframe

我想優化通過連接重新組合我的熊貓數據框(dk)的代碼:

dk = pd.DataFrame({'Point': {0: 15, 1: 16, 2: 16, 3: 17, 4: 17, 5: 18, 6: 18, 7: 19, 8: 20},
                   'join': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4}})

如果有兩個不同連接的組有一個相同的點,則設置為兩個組一個連接。 對於所有數據框也是如此。 我用簡單的代碼做到了:

dk['new'] = dk['join']
for i in dk.index:
    
    for j in range(i+1, dk.shape[0]):
        if dk['Point'][i] == dk['Point'][j]:
            dk['new'][j] = dk['join'][i]
            dk.loc[(dk['join'] == dk['join'][j]), 'new'] = dk['new'][i]   

我想要的結果:

df = {'Point': {0: 15, 1: 16, 2: 16, 3: 17, 4: 17, 5: 18, 6: 18, 7: 19, 8: 20},
 'join': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4},
 'new': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 4}}

但我需要為超過 450k 行的大數據發布它。 您知道如何針對此問題優化它或其他模塊嗎? (預先感謝)

您可以遍歷按“加入”分組的子 df 並在與先前“點”值的交集為空時遞增new (我不知道“點”是否總是在增加,但這將涵蓋它不是的情況):

df = pd.DataFrame()
new = None
point_set = {}
for j, sub_df in dk.groupby('join'):
    if new == None or not set(sub_df['Point']).intersection(point_set):
        new = j

    point_set = set(sub_df['Point'])
    sub_df['new'] = new
    df = pd.concat([df, sub_df])
    
print(df)

輸出:

   Point  join  new
0     15     0    0
1     16     0    0
2     16     1    0
3     17     1    0
4     17     2    0
5     18     2    0
6     18     3    0
7     19     3    0
8     20     4    4

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM