![](/img/trans.png)
[英]Create a column which is the mean of multiple columns in a data frame in pandas
[英]Weighted mean for multiple columns in a data frame in Pandas
我有一個如下的數據框
Class| Student| V1| V2| V3| wb
A| Max| 10| 12| 14| 1
A| Ann| 9| 6| 7| 0.9
B| Tom| 6| 7| 10| 0.3
B| Dick| 3| 8| 7| 0.7
C| Dibs| 5| 2| 3| 0.8
C| Mock| 6| 4| 3| 0.6
D| Sunny| 3| 4| 5| 0.9
D| Lock| 8| 3| 6| 1
我想計算按類分組的V1,V2,V3的加權平均值,結果應該如下所示
Class V1_M V2_M V3_M
A 9 8 3
B 5 3 3
C 4 4 3
到目前為止,我可以為每列分隔數據框。 但我覺得效率很低
這里是1個變量的代碼
import pandas as pd
import numpy as np
def wtdavg(frame, var, wb):
d = frame[var]
w = frame[wb]
return (d * w).sum() / w.sum()
df = pd.read_csv('Sample.csv')
Matrix = df.groupby(['Class']).apply(wtdavg,var='V2',wb='wb')
print(Matrix)
我是1周大熊貓經驗的新手。 提前致謝。
馬克斯
#use apply to calculate weighted mean for alll 3 columns in one go.
df2 = df.groupby('Class').apply(lambda x: pd.Series([sum(x.V1*x.wb)/sum(x.wb), sum(x.V2*x.wb)/sum(x.wb), sum(x.V3*x.wb)/sum(x.wb)]))
#rename columns
df2.columns=['V1_M','V2_M','V3_M']
df2
Out[858]:
V1_M V2_M V3_M
Class
A 9.526316 9.157895 10.684211
B 3.900000 7.700000 7.900000
C 5.428571 2.857143 3.000000
D 5.631579 3.473684 5.526316
更新(值列的動態列表,即var_cols
)
#put all your variable names in a list (can be copied over from df.columns)
var_cols = ['V1', 'V2', 'V3']
df2 = df.groupby('Class').apply(lambda x: pd.Series([sum(x[v] * x.wb) / sum(x.wb) for v in var_cols]))
df2.columns = [e+'_M' for e in var_cols]
V1_M V2_M V3_M
Class
A 9.526316 9.157895 10.684211
B 3.900000 7.700000 7.900000
C 5.428571 2.857143 3.000000
D 5.631579 3.473684 5.526316
更一般的解決方案
1.為沒有Student
, Class
所有欄目創建加權平均值:
df2 = df.drop('Student', axis=1) \
.groupby('Class') \
.apply(lambda x: x.drop(['Class', 'wb'], axis=1).mul(x.wb, 0).sum() / (x.wb).sum()) \
.add_suffix('_M') \
.reset_index()
print (df2)
Class V1_M V2_M V3_M
0 A 9.526316 9.157895 10.684211
1 B 3.900000 7.700000 7.900000
2 C 5.428571 2.857143 3.000000
3 D 5.631579 3.473684 5.526316
或者您可以為加權平均值定義列:
df2 = df.groupby('Class') \
.apply(lambda x: x[['V1', 'V2', 'V3']].mul(x.wb, 0).sum() / (x.wb).sum()) \
.add_suffix('_M') \
.reset_index()
print (df2)
Class V1_M V2_M V3_M
0 A 9.526316 9.157895 10.684211
1 B 3.900000 7.700000 7.900000
2 C 5.428571 2.857143 3.000000
3 D 5.631579 3.473684 5.526316
更一般的是過濾所有列以V
開頭filter
:
df2 = df.groupby('Class') \
.apply(lambda x: x.filter(regex='^V').mul(x.wb, 0).sum() / (x.wb).sum()) \
.add_suffix('_M') \
.reset_index()
print (df2)
Class V1_M V2_M V3_M
0 A 9.526316 9.157895 10.684211
1 B 3.900000 7.700000 7.900000
2 C 5.428571 2.857143 3.000000
3 D 5.631579 3.473684 5.526316
import pandas as pd
import numpy as np
def wtdavg(frame, var, wb):
d = frame[var]
w = frame[wb]
return (d * w).sum() / w.sum()
df = pd.read_csv('Sample.csv')
temp_df = pd.DataFrame()
for column in df.columns:
if df[column].dtype == np.int64:
temp_S = pd.DataFrame( df[column].groupby(df['Class']).mean())
frames = [temp_df, temp_S]
temp_df = pd.concat(frames, axis = 'columns')
print temp_df
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.