简体   繁体   中英

Get mean of multiple selected columns in a pandas dataframe

I want to calculate the mean of all the values in selected columns in a dataframe. For example, I have a dataframe with columns A, B, C, D and E and I want the mean of all the values in columns A, C and E.

import pandas as pd

df1 = pd.DataFrame( ( {'A': [1,2,3,4,5],
                      'B': [10,20,30,40,50],
                      'C': [11,21,31,41,51],
                      'D': [12,22,32,42,52],
                      'E': [13,23,33,43,53]} ) )

print( df1 )

print( "Mean of df1:", df1.mean() )

df2 = pd.concat( [df1['A'], df1['C'], df1['E'] ], ignore_index=True )
print( df2 )
print( "Mean of df2:", df2.mean() )

df3 = pd.DataFrame()
df3 = pd.concat( [ df3, df1['A'] ], ignore_index=True )
df3 = pd.concat( [ df3, df1['C'] ], ignore_index=True )
df3 = pd.concat( [ df3, df1['E'] ], ignore_index=True )
print( df3 )
print( "Mean of df3:", df3.mean() )

df2 gets me the right answer, but I need to create a new dataframe to get it.

I though something like df1['A', 'C', 'E'].mean() would work but it returns the mean values for each column, not the combined average. Is there a way to do this without creating a new dataframe? I also need other data statistics like.std(), .min(), max() so this isn't just a one-off calculation.

Caveat: only okay if the columns are of the same length. If not it would give the wrong answer (as the comments pointed out).

mean = df1[['A', 'C', 'E']].mean(axis=1).mean()    
print(mean)

You can reshape DataFrame to Series with Multiindex by DataFrame.stack and then use mean :

df2 = df1[['A', 'C', 'E']].stack()
print (df2)
0  A     1
   C    11
   E    13
1  A     2
   C    21
   E    23
2  A     3
   C    31
   E    33
3  A     4
   C    41
   E    43
4  A     5
   C    51
   E    53
dtype: int64

print( "Mean of df2:", df2.mean() )
Mean of df2: 22.333333333333332

Another idea is convert values to numpy 2d array and then use np.mean :

df21 = df1[['A', 'C', 'E']]
print( df21 )
   A   C   E
0  1  11  13
1  2  21  23
2  3  31  33
3  4  41  43
4  5  51  53

print(df21.to_numpy())
[[ 1 11 13]
 [ 2 21 23]
 [ 3 31 33]
 [ 4 41 43]
 [ 5 51 53]]

print( "Mean of df2:", np.mean(df21.to_numpy()) )
Mean of df2: 22.333333333333332

You have two options that I know of:

for mean(), min(), max() you can use mean of mean, min of min, max of max this would yield, mean, min, max of all the elements of A, C, E.

So you can use: for mean(): enter code here

df1[['A','C','E']].apply(np.mean).mean()
df1[['A','C','E']].values.mean() 

Any one of the above should give you the mean of all the elements of columns A, C, E.

for min():

df1[['A','C','E']].apply(np.min).min()
df1[['A','C','E']].values.min()  

For max():

df1[['A','C','E']].apply(np.max).max()
df1[['A','C','E']].values.max() 

For std()

df1[['A','C','E']].apply(np.std).std()    ##  this will not give error, but gives a 
                       value that is not what you want.
df1[['A','C','E']].values.std()    # this gives the std of all the elements of columns A, C, E.

std of std will not give the std of all the elements.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM