簡體   English   中英

基於其他列的分組在 Pandas 列中向前填充或向后填充 NaN 值

[英]Forward fill or back fill NaN values in Pandas columns based on grouping of other columns

我有一個 dataframe 如下:

import pandas as pd

df = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],
                   'Region':['Americas','NaN','NaN','Asia','Europe','NaN','NaN'],
                   'Flower':['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],
                   'Animal':['Bison','NaN','Golden Eagle','Tiger','Lion','Lion','NaN'],
                   'Game':['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})

我想按CountryFlower分組,並向前填充或向后填充存在缺失值的RegionAnimal列。 但是Game欄應該保持不變

我試過這個但沒有用:

df['Region'] = df.groupby(['Country','Flower'])['Region'].transform(lambda x: x.ffill())

還:

df.groupby(['Country','Flower'])['Animal', 'Region'].isna().bfill()

我想知道go怎么用這個。

雖然這有效,但它刪除了游戲列:

df=df.replace({'NaN':np.nan}) df.groupby(['Country','Flower'])['Animal', 'Region'].bfill().ffill()

如果我進行轉換,則長度不匹配。 另請注意,這是示例 dataframe,我在原始框架中添加了“NaN”作為字符串,它是 np.nan。

首先,您需要知道'NaN'不是NaN

df=df.replace({'NaN':np.nan})
df.groupby(['Country','Flower'])['Region'].ffill()
Out[109]: 
0    Americas
1    Americas
2         NaN# since here only have single row , that why stay NaN
3        Asia
4      Europe
5      Europe
6      Europe
Name: Region, dtype: object

其次,如果您需要在pandas鏈接兩個iid函數,則需要apply

df.update(df.groupby(['Country','Flower'])['Animal', 'Region'].apply(lambda x : x.bfill().ffill()))                               
df
Out[119]: 
         Animal Country     Flower      Game    Region
0         Bison     USA       Rose  Baseball  Americas
1         Bison     USA       Rose  Baseball  Americas
2  Golden Eagle     MEX       Lily    soccer       NaN
3         Tiger     IND     Orchid    hockey      Asia
4          Lion      UK  Dandelion   cricket    Europe
5          Lion      UK  Dandelion   cricket    Europe
6          Lion      UK  Dandelion   cricket    Europe

如果您將數據np.nan代碼更改為實際包含np.nan ,則您提供的代碼將實際np.nan 盡管nans顯示為普通文本“ Nan”,但是您無法創建一個手工編寫該文本的數據框,因為它將被解釋為字符串,而不是實際的缺失值。

import pandas as pd
import numpy as np

df = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],
                   'Region':['Americas',np.nan,np.nan,'Asia','Europe',np.nan,np.nan],
                   'Flower':['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],
                   'Animal':['Bison',np.nan,'Golden Eagle','Tiger','Lion','Lion','NaN'],
                   'Game':['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})

然后,這:

df['Region'] = df.groupby(['Country','Flower'])['Region'].transform(lambda x: x.ffill())

實際上產生這個:

         Animal Country     Flower      Game    Region
0         Bison     USA       Rose  Baseball  Americas
1           NaN     USA       Rose  Baseball  Americas
2  Golden Eagle     MEX       Lily    soccer       NaN
3         Tiger     IND     Orchid    hockey      Asia
4          Lion      UK  Dandelion   cricket    Europe
5          Lion      UK  Dandelion   cricket    Europe
6           NaN      UK  Dandelion   cricket    Europe
由於 Mex 和 Lily 只是行,而且它們的區域值是 nan,fillna function 無法找到合適的組值。 如果我們在 fillna 組模式下捕獲異常,那么那些沒有組的值將保持原樣。 然后應用 ffill 和 bfill 來覆蓋那些沒有適當組的值

df_stack = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],'Region': ['Americas',np.nan,np.nan,'Asia','Europe',np.nan,np.nan],'Flower': ['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],'Animal':['Bison',np.nan,'Golden Eagle','Tiger','Lion','Lion',np.nan],'Game':  ['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
print("-------Before imputation------")
print(df_stack)
def fillna_Region(grp):
    try:
        return grp.fillna(grp.mode()[0])
    except BaseException as e:
        print('Error as no correspindg group: ' + str(e))
df_stack["Region"] = 
df_stack["Region"].fillna(df_stack.groupby(['Country','Flower']) ['Region'].transform(lambda grp : fillna_Region(grp)))
df_stack["Animal"] = 
df_stack["Animal"].fillna(df_stack.groupby(['Country','Flower']) ['Animal'].transform(lambda grp : fillna_Region(grp)))

df_stack = df_stack.ffill(axis = 0) df_stack = df_stack.bfill(axis =0)

print("-------After imputation------") print(df_stack)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM