簡體   English   中英

Pandas中數據框中多列的加權平均值

[英]Weighted mean for multiple columns in a data frame in Pandas

我有一個如下的數據框

Class|  Student|    V1| V2| V3| wb

A|      Max|        10| 12| 14| 1

A|      Ann|        9|  6|  7|  0.9

B|      Tom|        6|  7|  10| 0.3

B|      Dick|       3|  8|  7|  0.7

C|      Dibs|       5|  2|  3|  0.8

C|      Mock|       6|  4|  3|  0.6

D|      Sunny|      3|  4|  5|  0.9

D|      Lock|       8|  3|  6|  1

我想計算按類分組的V1,V2,V3的加權平均值,結果應該如下所示

Class  V1_M  V2_M V3_M

A   9  8   3

B   5  3   3

C   4  4   3

到目前為止,我可以為每列分隔數據框。 但我覺得效率很低

這里是1個變量的代碼

import pandas as pd
import numpy as np

def wtdavg(frame, var, wb):
  d = frame[var]
  w = frame[wb]
  return (d * w).sum() / w.sum()

df = pd.read_csv('Sample.csv')
Matrix = df.groupby(['Class']).apply(wtdavg,var='V2',wb='wb')
print(Matrix)

我是1周大熊貓經驗的新手。 提前致謝。

馬克斯

#use apply to calculate weighted mean for alll 3 columns in one go.
df2 = df.groupby('Class').apply(lambda x: pd.Series([sum(x.V1*x.wb)/sum(x.wb), sum(x.V2*x.wb)/sum(x.wb), sum(x.V3*x.wb)/sum(x.wb)]))
#rename columns
df2.columns=['V1_M','V2_M','V3_M']

df2
Out[858]: 
           V1_M      V2_M       V3_M
Class                               
A      9.526316  9.157895  10.684211
B      3.900000  7.700000   7.900000
C      5.428571  2.857143   3.000000
D      5.631579  3.473684   5.526316

更新(值列的動態列表,即var_cols

#put all your variable names in a list (can be copied over from df.columns)
var_cols = ['V1', 'V2', 'V3']
df2 = df.groupby('Class').apply(lambda x: pd.Series([sum(x[v] * x.wb) / sum(x.wb) for v in var_cols]))
df2.columns = [e+'_M' for e in var_cols]
           V1_M      V2_M       V3_M
Class                               
A      9.526316  9.157895  10.684211
B      3.900000  7.700000   7.900000
C      5.428571  2.857143   3.000000
D      5.631579  3.473684   5.526316

更一般的解決方案

1.為沒有StudentClass所有欄目創建加權平均值:

df2 = df.drop('Student', axis=1) \
        .groupby('Class') \
        .apply(lambda x: x.drop(['Class', 'wb'], axis=1).mul(x.wb, 0).sum() / (x.wb).sum()) \
        .add_suffix('_M') \
        .reset_index()
print (df2)
  Class      V1_M      V2_M       V3_M
0     A  9.526316  9.157895  10.684211
1     B  3.900000  7.700000   7.900000
2     C  5.428571  2.857143   3.000000
3     D  5.631579  3.473684   5.526316

或者您可以為加權平均值定義列:

df2 = df.groupby('Class') \
        .apply(lambda x: x[['V1', 'V2', 'V3']].mul(x.wb, 0).sum() / (x.wb).sum()) \
        .add_suffix('_M') \
        .reset_index()
print (df2)
  Class      V1_M      V2_M       V3_M
0     A  9.526316  9.157895  10.684211
1     B  3.900000  7.700000   7.900000
2     C  5.428571  2.857143   3.000000
3     D  5.631579  3.473684   5.526316

更一般的是過濾所有列以V開頭filter

df2 = df.groupby('Class') \
        .apply(lambda x: x.filter(regex='^V').mul(x.wb, 0).sum() / (x.wb).sum()) \
        .add_suffix('_M') \
        .reset_index()
print (df2)
  Class      V1_M      V2_M       V3_M
0     A  9.526316  9.157895  10.684211
1     B  3.900000  7.700000   7.900000
2     C  5.428571  2.857143   3.000000
3     D  5.631579  3.473684   5.526316
import pandas as pd
import numpy as np

def wtdavg(frame, var, wb):
  d = frame[var]
  w = frame[wb]
  return (d * w).sum() / w.sum()

df = pd.read_csv('Sample.csv')
temp_df = pd.DataFrame()
for column in df.columns:
    if df[column].dtype == np.int64:
        temp_S = pd.DataFrame( df[column].groupby(df['Class']).mean())
        frames = [temp_df, temp_S]
        temp_df = pd.concat(frames, axis = 'columns')
print temp_df

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM