简体   繁体   中英

Mean of all rows which meet a certain condition in Pandas dataframe

Say I've got the dataframe:

    Code  Value
1   X     135
2   D     298
3   F     301
4   G     12
5   D     203
6   X     212
7   D     401
8   D     125

I want to create a new column in this dataframe which calculates the mean for all the rows in the dataframe where the value in the 'Code' column is the respective value in each row.

For instance, in row 1, the 'Mean' column would find the mean of all rows where Code is 'X'

You can use pd.Series.map() this way:

df['Code_mean'] = df.Code.map(df.groupby(['Code']).Value.mean())

>>> df
Out[]:
  Code  Value  Code_mean
1    X    135     173.50
2    D    298     256.75
3    F    301     301.00
4    G     12      12.00
5    D    203     256.75
6    X    212     173.50
7    D    401     256.75
8    D    125     256.75

This seems to be faster than transform approach.


EDIT: benchmark to answer comments

import pandas as pd
from string import ascii_letters

df = pd.DataFrame(columns=['Code', 'Value'])
df.Code = [ascii_letters[26:][i] for i in np.random.randint(0, 26, 10000)]
df.Value = np.random.randint(0, 1024, 10000)

>>> %%timeit
... df['Code_mean'] = df.Code.map(df.groupby(['Code']).Value.mean())
1000 loops, best of 3: 1.45 ms per loop

# Reinit df before next timeit

>>> %%timeit
... df.assign(Code_mean=df.groupby('Code').transform('mean'))
100 loops, best of 3: 2.31 ms per loop

But after testing results does go in favour of transform for larger dataframes (10^6 rows)

import pandas as pd
from string import ascii_letters

df = pd.DataFrame(columns=['Code', 'Value'])
df.Code = [ascii_letters[26:][i] for i in np.random.randint(0, 26, 1000000)]
df.Value = np.random.randint(0, 1024, 1000000)

>>> %%timeit
... df['Code_mean'] = df.Code.map(df.groupby(['Code']).Value.mean())
10 loops, best of 3: 95.2 ms per loop

# Reinit df before next timeit

>>> %%timeit
... df.assign(Code_mean=df.groupby('Code').transform('mean'))
10 loops, best of 3: 68.2 ms per loop

This is a good application for the transform method after grouping by the codes.

>>> df['Group_means'] = df.groupby('Code').transform('mean')
>>> df
  Code  Value  Group_means
0    X    135       173.50
1    D    298       256.75
2    F    301       301.00
3    G     12        12.00
4    D    203       256.75
5    X    212       173.50
6    D    401       256.75
7    D    125       256.75

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM