[英]Forward fill or back fill NaN values in Pandas columns based on grouping of other columns
我有一個 dataframe 如下:
import pandas as pd
df = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],
'Region':['Americas','NaN','NaN','Asia','Europe','NaN','NaN'],
'Flower':['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],
'Animal':['Bison','NaN','Golden Eagle','Tiger','Lion','Lion','NaN'],
'Game':['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
我想按Country
和Flower
分組,並向前填充或向后填充存在缺失值的Region
和Animal
列。 但是Game
欄應該保持不變
我試過這個但沒有用:
df['Region'] = df.groupby(['Country','Flower'])['Region'].transform(lambda x: x.ffill())
還:
df.groupby(['Country','Flower'])['Animal', 'Region'].isna().bfill()
我想知道go怎么用這個。
雖然這有效,但它刪除了游戲列:
df=df.replace({'NaN':np.nan})
df.groupby(['Country','Flower'])['Animal', 'Region'].bfill().ffill()
如果我進行轉換,則長度不匹配。 另請注意,這是示例 dataframe,我在原始框架中添加了“NaN”作為字符串,它是 np.nan。
首先,您需要知道'NaN'
不是NaN
df=df.replace({'NaN':np.nan})
df.groupby(['Country','Flower'])['Region'].ffill()
Out[109]:
0 Americas
1 Americas
2 NaN# since here only have single row , that why stay NaN
3 Asia
4 Europe
5 Europe
6 Europe
Name: Region, dtype: object
其次,如果您需要在pandas
鏈接兩個iid函數,則需要apply
df.update(df.groupby(['Country','Flower'])['Animal', 'Region'].apply(lambda x : x.bfill().ffill()))
df
Out[119]:
Animal Country Flower Game Region
0 Bison USA Rose Baseball Americas
1 Bison USA Rose Baseball Americas
2 Golden Eagle MEX Lily soccer NaN
3 Tiger IND Orchid hockey Asia
4 Lion UK Dandelion cricket Europe
5 Lion UK Dandelion cricket Europe
6 Lion UK Dandelion cricket Europe
如果您將數據np.nan
代碼更改為實際包含np.nan
,則您提供的代碼將實際np.nan
。 盡管nans顯示為普通文本“ Nan”,但是您無法創建一個手工編寫該文本的數據框,因為它將被解釋為字符串,而不是實際的缺失值。
import pandas as pd
import numpy as np
df = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],
'Region':['Americas',np.nan,np.nan,'Asia','Europe',np.nan,np.nan],
'Flower':['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],
'Animal':['Bison',np.nan,'Golden Eagle','Tiger','Lion','Lion','NaN'],
'Game':['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
然后,這:
df['Region'] = df.groupby(['Country','Flower'])['Region'].transform(lambda x: x.ffill())
實際上產生這個:
Animal Country Flower Game Region
0 Bison USA Rose Baseball Americas
1 NaN USA Rose Baseball Americas
2 Golden Eagle MEX Lily soccer NaN
3 Tiger IND Orchid hockey Asia
4 Lion UK Dandelion cricket Europe
5 Lion UK Dandelion cricket Europe
6 NaN UK Dandelion cricket Europe
由於 Mex 和 Lily 只是行,而且它們的區域值是 nan,fillna function 無法找到合適的組值。 如果我們在 fillna 組模式下捕獲異常,那么那些沒有組的值將保持原樣。 然后應用 ffill 和 bfill 來覆蓋那些沒有適當組的值
df_stack = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],'Region': ['Americas',np.nan,np.nan,'Asia','Europe',np.nan,np.nan],'Flower': ['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],'Animal':['Bison',np.nan,'Golden Eagle','Tiger','Lion','Lion',np.nan],'Game': ['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']}) print("-------Before imputation------") print(df_stack) def fillna_Region(grp): try: return grp.fillna(grp.mode()[0]) except BaseException as e: print('Error as no correspindg group: ' + str(e)) df_stack["Region"] = df_stack["Region"].fillna(df_stack.groupby(['Country','Flower']) ['Region'].transform(lambda grp : fillna_Region(grp))) df_stack["Animal"] = df_stack["Animal"].fillna(df_stack.groupby(['Country','Flower']) ['Animal'].transform(lambda grp : fillna_Region(grp)))
df_stack = df_stack.ffill(axis = 0) df_stack = df_stack.bfill(axis =0)
print("-------After imputation------") print(df_stack)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.