简体   繁体   English

Pandas:每组均值填充缺失值

[英]Pandas: filling missing values by mean in each group

This should be straightforward, but the closest thing I've found is this post: pandas: Filling missing values within a group , and I still can't solve my problem....这应该很简单,但我发现最接近的是这篇文章: pandas: Filling missing values within a group ,但我仍然无法解决我的问题....

Suppose I have the following dataframe假设我有以下 dataframe

df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})

  name  value
0    A      1
1    A    NaN
2    B    NaN
3    B      2
4    B      3
5    B      1
6    C      3
7    C    NaN
8    C      3

and I'd like to fill in "NaN" with mean value in each "name" group, ie我想在每个“名称”组中用平均值填写“NaN”,即

      name  value
0    A      1
1    A      1
2    B      2
3    B      2
4    B      3
5    B      1
6    C      3
7    C      3
8    C      3

I'm not sure where to go after:我不确定 go 去哪里:

grouped = df.groupby('name').mean()

Thanks a bunch.非常感谢。

One way would be to use transform :一种方法是使用transform

>>> df
  name  value
0    A      1
1    A    NaN
2    B    NaN
3    B      2
4    B      3
5    B      1
6    C      3
7    C    NaN
8    C      3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
  name  value
0    A      1
1    A      1
2    B      2
3    B      2
4    B      3
5    B      1
6    C      3
7    C      3
8    C      3

fillna + groupby + transform + mean fillna + groupby + transform + mean

This seems intuitive:这看起来很直观:

df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))

The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. groupby + transform语法将分组均值映射到原始数据帧的索引。 This is roughly equivalent to @DSM's solution , but avoids the need to define an anonymous lambda function.这大致相当于@DSM 的解决方案,但避免了定义匿名lambda函数的需要。

@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns: @DSM 有 IMO 正确答案,但我想分享我对问题的概括和优化:Multiple columns to group-by and have multiple value columns:

df = pd.DataFrame(
    {
        'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
        'name': ['A','A', 'B','B','B','B', 'C','C','C'],
        'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
        'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
    }
)

... gives ... ...给...

  category name  other_value value
0        X    A         10.0   1.0
1        X    A          NaN   NaN
2        X    B          NaN   NaN
3        X    B         20.0   2.0
4        X    B         30.0   3.0
5        X    B         10.0   1.0
6        Y    C         30.0   3.0
7        Y    C          NaN   NaN
8        Y    C         30.0   3.0

In this generalized case we would like to group by category and name , and impute only on value .在这种广义的情况下,我们希望按categoryname进行分组,并且只根据value进行估算。

This can be solved as follows:这可以解决如下:

df['value'] = df.groupby(['category', 'name'])['value']\
    .transform(lambda x: x.fillna(x.mean()))

Notice the column list in the group-by clause, and that we select the value column right after the group-by.请注意 group-by 子句中的列列表,我们在 group-by 之后选择了value列。 This makes the transformation only be run on that particular column.这使得转换仅在该特定列上运行。 You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end.您可以将它添加到末尾,但随后您将对所有列运行它,只在最后丢弃除一个度量列之外的所有列。 A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.一个标准的 SQL 查询计划器可能已经能够对此进行优化,但 pandas (0.19.2) 似乎没有这样做。

Performance test by increasing the dataset by doing ...通过增加数据集进行性能测试...

big_df = None
for _ in range(10000):
    if big_df is None:
        big_df = df.copy()
    else:
        big_df = pd.concat([big_df, df])
df = big_df

... confirms that this increases the speed proportional to how many columns you don't have to impute: ...确认这会增加与您不必估算的列数成正比的速度:

import pandas as pd
from datetime import datetime

def generate_data():
    ...

t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
    .transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)

# 0:00:00.016012

t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
    .transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)

# 0:00:00.030022

On a final note you can generalize even further if you want to impute more than one column, but not all:最后一点,如果您想估算多个列,但不是全部,您可以进一步概括:

df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
    .transform(lambda x: x.fillna(x.mean()))

Shortcut:捷径:

Groupby + Apply + Lambda + Fillna + Mean Groupby + Apply + Lambda + Fillna + Mean

>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
    0 

This solution still works if you want to group by multiple columns to replace missing values.如果您想按多列分组以替换缺失值,此解决方案仍然有效。

>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3], 
    'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})  

    
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
       
>>> df
        value name   class
    0    1.0    A     p
    1    1.0    A     p
    2    2.0    B     q
    3    2.0    B     q
    4    3.0    B     r
    5    3.0    B     r
    6    3.5    C     s
    7    4.0    C     s
    8    3.0    C     s
 

我会这样做

df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')

The featured high ranked answer only works for a pandas Dataframe with only two columns.精选的高排名答案仅适用于只有两列的熊猫数据框。 If you have a more columns case use instead:如果您有更多列案例,请改用:

df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
    lambda x: x.fillna(x.mean()))
def groupMeanValue(group):
    group['value'] = group['value'].fillna(group['value'].mean())
    return group

dft = df.groupby("name").transform(groupMeanValue)

To summarize all above concerning the efficiency of the possible solution I have a dataset with 97 906 rows and 48 columns.总结以上关于可能解决方案的效率的所有内容,我有一个包含 97 906 行和 48 列的数据集。 I want to fill in 4 columns with the median of each group.我想用每组的中位数填写 4 列。 The column I want to group has 26 200 groups.我要分组的列有 26 200 个组。

The first solution第一个解决方案

start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds

The second solution第二种解决方案

start = time.time()
for col in continuous_variables:
    df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds

The next solution I only performed on a subset since it was running too long.下一个解决方案我只在一个子集上执行,因为它运行时间太长。

start = time.time()
for col in continuous_variables:
    x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds

The following solution follows the same logic as above.以下解决方案遵循与上述相同的逻辑。

start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds

So it's quite important to choose the right method.因此,选择正确的方法非常重要。 Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).请记住,我注意到一旦一列不是数字,时间就会呈指数增长(因为我正在计算中位数是有道理的)。

df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)

I know that is an old question.我知道这是一个老问题。 But I am quite surprised by the unanimity of apply / lambda answers here.但是这里的apply / lambda答案的一致让我感到很惊讶。

Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.一般来说,从时间的角度来看,这是继迭代行之后第二糟糕的事情。

What I would do here is我在这里要做的是

df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')

Or using fillna或者使用 fillna

df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))

I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct).我已经检查过 timeit(因为,再次,基于应用/lambda 的解决方案的一致意见让我怀疑我的直觉)。 And that is indeed 2.5 faster than the most upvoted solutions.这确实比最受欢迎的解决方案快 2.5。

To fill all the numeric null values with the mean grouped by "name"用按“名称”分组的平均值填充所有数字 null 值

num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))

您还可以使用"dataframe or table_name".apply(lambda x: x.fillna(x.mean()))

I just did this 我只是这样做

df.fillna(df.mean(), inplace=True)

All missing values within your DataFrame will be filled by mean. DataFrame中的所有缺失值将以均值填充。 If that is what you're looking for. 如果那是您想要的。 This worked for me. 这对我有用。 It's simple, and gets the job done. 这很简单,可以完成工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM