简体   繁体   中英

Create a column which is the mean of multiple columns in a data frame in pandas

So I've looked at multiple potential solutions but none seem to work.

Basically, I want to create a new column in my data frame which is the mean of multiple other columns. I want this mean to exclude NaN values but still calculate the mean even if there are NaN values in the row.

I have a data frame which looks something like this (but actually Q222-229):

ID   Q1   Q2   Q3   Q4   Q5
1    4    NaN  NaN  NaN  NaN
2    5    7    8    NaN  NaN
3    7    1    2    NaN  NaN
4    2    2    3    4    1
5    1    3    NaN  NaN  NaN

And I want to create a column which is the mean of Q1, Q2, Q3, Q4, Q5 ie:

ID   Q1   Q2   Q3   Q4   Q5   avg_age
1    4    NaN  NaN  NaN  NaN  4
2    5    7    8    NaN  NaN  5.5
3    7    1    2    NaN  NaN  3.5
4    2    2    3    4    1    2
5    1    3    NaN  NaN  NaN  2

(ignore values)

However, every method I have tried returns NaN values in the avg_age column which is making me think that when ignoring the NaN values, pandas is ignoring the whole row. But I dont want this to happen, instead I want the mean returned with the NaN values ignored.

Here is what I have tried so far:

1.
    avg_age = s.loc[: , "Q222":"Q229"]
    avg_age = avg_age.mean(axis=1)
    s = pd.concat([s, avg_age], axis=1)

2.
    s['avg_age'] = s[['Q222', 'Q223', 'Q224', 'Q225', 'Q226', 'Q227', 'Q228', 'Q229']].mean(axis=1)

3.

    avg_age = ['Q222', 'Q223', 'Q224', 'Q225', 'Q226', 'Q227', 'Q228', 'Q229']
    s.loc[:, 'avg_age'] = s[avg_age].mean(axis=1)

I am not sure if there is something wrong with the way I have coded the values initially so here is my code for reference:

#Changing age variable inputs

s['Q222'] = s['Q222'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                              ['2','3','4','5', '6', '7', '8', np.NaN])
s['Q223'] = s['Q223'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                              ['2','3','4','5', '6', '7', '8', np.NaN])
s['Q224'] = s['Q224'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                              ['2','3','4','5', '6', '7', '8', np.NaN])
s['Q225'] = s['Q225'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                              ['2','3','4','5', '6', '7', '8', np.NaN])
s['Q226'] = s['Q226'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                              ['2','3','4','5', '6', '7', '8', np.NaN])
s['Q227'] = s['Q227'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                              ['2','3','4','5', '6', '7', '8', np.NaN])
s['Q228'] = s['Q228'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                              ['2','3','4','5', '6', '7', '8', np.NaN])
s['Q229'] = s['Q229'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                              ['2','3','4','5', '6', '7', '8', np.NaN])

s['Q222'] = s['Q222'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q223'] = s['Q223'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q224'] = s['Q224'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q225'] = s['Q225'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q226'] = s['Q226'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q227'] = s['Q227'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q228'] = s['Q228'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q229'] = s['Q229'].replace(['0-4', '05-11', '12-15', '16-17'], '1')

Thanks in advance to anyone who is able to help!

skipna=True

Can get it with a list comprehension to get the columns to average, and mean() with:

df['ave_age'] = df[[col for col in df.columns if 'Q' in col]].mean(axis = 1,skipna = True)

The default behavior of DataFrame.mean() should do what you want.

Here's an example showing taking a mean over a subset of the columns and placing it in a newly created column:

In[19]: tmp
Out[19]: 
   a  b    c
0  1  2  5.0
1  2  3  6.0
2  3  4  NaN

In[24]: tmp['mean'] = tmp[['b', 'c']].mean(axis=1)

In[25]: tmp
Out[25]: 
   a  b    c  mean
0  1  2  5.0   3.5
1  2  3  6.0   4.5
2  3  4  NaN   4.0

As for what's going wrong in your code:

 s['Q222'] = s['Q222'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"], ['2','3','4','5', '6', '7', '8', np.NaN])

You don't have numerical values (ie 2, 3, 4) in your data frame, you have strings ('2', '3', and '4'). The DataFrame.mean() function is treating these strings as NaN, so you are getting NaN as the result for all your mean calculations.

Try filling your frame with numbers, like so:

 s['Q222'] = s['Q222'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                          [2, 3, 4, 5, 6, 7, 8, np.NaN])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM