[英]How to optimize regrouping code for dataframe
我想優化通過連接重新組合我的熊貓數據框(dk)的代碼:
dk = pd.DataFrame({'Point': {0: 15, 1: 16, 2: 16, 3: 17, 4: 17, 5: 18, 6: 18, 7: 19, 8: 20},
'join': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4}})
如果有兩個不同連接的組有一個相同的點,則設置為兩個組一個連接。 對於所有數據框也是如此。 我用簡單的代碼做到了:
dk['new'] = dk['join']
for i in dk.index:
for j in range(i+1, dk.shape[0]):
if dk['Point'][i] == dk['Point'][j]:
dk['new'][j] = dk['join'][i]
dk.loc[(dk['join'] == dk['join'][j]), 'new'] = dk['new'][i]
我想要的結果:
df = {'Point': {0: 15, 1: 16, 2: 16, 3: 17, 4: 17, 5: 18, 6: 18, 7: 19, 8: 20},
'join': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4},
'new': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 4}}
但我需要為超過 450k 行的大數據發布它。 您知道如何針對此問題優化它或其他模塊嗎? (預先感謝)
您可以遍歷按“加入”分組的子 df 並在與先前“點”值的交集為空時遞增new
(我不知道“點”是否總是在增加,但這將涵蓋它不是的情況):
df = pd.DataFrame()
new = None
point_set = {}
for j, sub_df in dk.groupby('join'):
if new == None or not set(sub_df['Point']).intersection(point_set):
new = j
point_set = set(sub_df['Point'])
sub_df['new'] = new
df = pd.concat([df, sub_df])
print(df)
輸出:
Point join new
0 15 0 0
1 16 0 0
2 16 1 0
3 17 1 0
4 17 2 0
5 18 2 0
6 18 3 0
7 19 3 0
8 20 4 4
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.