[英]reshaping data frame and applying calculation for each row
我有一個數據框,如下所示:
df=pd.DataFrame({ 'family' : ["A","A","B","B"],
'V1' : [5,5,40,10,],
'V2' :[50,10,180,20],
'gr_0' :["all","all","all","all"],
'gr_1' :["m1","m1","m2","m3"],
'gr_2' :["m12","m12","m12","m9"],
'gr_3' :["NO","m14","m15","NO"]
})
我想通過以下方式對其進行轉換:
df_new=pd.DataFrame({ 'family' : ["A","A","A","A","B","B","B","B","B","B"],
'gr' : ["all","m1","m12","m14","all","m2","m3","m12","m9","m15"],
"calc(sumV2/sumV1)":[6,6,6,2,4,4.5,2,4.5,2,4.5]
})
family gr calc(sumV2/sumV1)
0 A all 6.0
1 A m1 6.0
2 A m12 6.0
3 A m14 2.0
4 B all 4.0
5 B m2 4.5
6 B m3 2.0
7 B m12 4.5
8 B m9 2.0
9 B m15 4.5
為了達到df_new:
我是python的新手。 對我來說,這的軟編碼似乎很復雜。 最好,我不想在此df_new中列出“否”記錄,但它也可以保留在輸出中。
您可以這樣做:
df_new = df.melt(id_vars=['family','V1','V2']).groupby(['family','value'])
.apply(lambda x: x.V2.sum()/x.V1.sum())
.reset_index(name='calc(sumV2/sumV1)')
df_new = df_new[df_new.value != 'NO'].reset_index(drop=True)
print(df_new)
family value calc(sumV2/sumV1)
0 A all 6.0
1 A m1 6.0
2 A m12 6.0
3 A m14 2.0
4 B all 4.0
5 B m12 4.5
6 B m15 4.5
7 B m2 4.5
8 B m3 2.0
9 B m9 2.0
melt
+ groupby
: v = df.melt(id_vars=['family','V1','V2'], value_name='gr')
w = v.loc[v.gr != 'NO']
x = w.groupby(['family', 'gr']).sum()
(x.V2 / x.V1).reset_index(name='calc(sumV2/sumV1)')
family gr calc(sumV2/sumV1)
0 A all 6.0
1 A m1 6.0
2 A m12 6.0
3 A m14 2.0
4 B all 4.0
5 B m12 4.5
6 B m15 4.5
7 B m2 4.5
8 B m3 2.0
9 B m9 2.0
與此答案類似的方法,但具有完全矢量化的優點,避免了apply
性能 :
a = np.random.randint(1, 1000, (1_000_000, 7))
df = pd.DataFrame(a, columns=['family', 'V1', 'V2', 'gr_0', 'gr_1', 'gr_2', 'gr_3'])
df[['gr_0', 'gr_1', 'gr_2', 'gr_3']] = df[['gr_0', 'gr_1', 'gr_2', 'gr_3']].astype(str)
%%timeit
v = df.melt(id_vars=['family','V1','V2'], value_name='gr')
w = v.loc[v.gr != 'NO']
x = w.groupby(['family', 'gr']).sum()
(x.V2 / x.V1).reset_index(name='calc(sumV2/sumV1)')
2.71 s ± 32.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df_new = (df.melt(id_vars=['family','V1','V2']).groupby(['family','value'])
.apply(lambda x: x.V2.sum()/x.V1.sum())
.reset_index(name='calc(sumV2/sumV1)'))
df_new = df_new[df_new.value != 'NO'].reset_index(drop=True)
5min 24s ± 3.35 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.