简体   繁体   English

用 pandas 替换 dataframe 中的 NaN 值

[英]replacing NaN values in dataframe with pandas

I want to create a function that takes a dataframe and replaces NaN with the mode in categorical columns, and replaces NaN in numerical columns with the mean of that column.我想创建一个 function ,它采用 dataframe 并将 NaN 替换为分类列中的模式,并将数值列中的 NaN 替换为该列的平均值。 If there are more than one mode in the categorical columns, then it should use the first mode.如果分类列中有多个模式,则应使用第一种模式。

I have managed to do it with following code:我已经设法用以下代码做到了:

def exercise4(df):
    df1 = df.select_dtypes(np.number)
    df2 = df.select_dtypes(exclude = 'float')
    mode = df2.mode()
    df3 = df1.fillna(df.mean())
    df4 = df2.fillna(mode.iloc[0,:])
    new_df = [df3,df4]
    df5 = pd.concat(new_df,axis=1)
    new_cols = list(df.columns)
    df6 = df5[new_cols]
    return df6

But i am sure there is a far easier method to do this?但我确信有一种更简单的方法可以做到这一点?

You can use:您可以使用:

df = pd.DataFrame({
        'A':list('abcdec'),
         'B':[4,5,4,5,5,4],
         'C':[7,8,9,4,2,3],
         'D':[1,3,5,7,1,0],
         'E':list('bbcdeb'),
})
df.iloc[[1,3], [1,2,0,4]] = np.nan

print (df)
     A    B    C  D    E
0    a  4.0  7.0  1    b
1  NaN  NaN  NaN  3  NaN
2    c  4.0  9.0  5    c
3  NaN  NaN  NaN  7  NaN
4    e  5.0  2.0  1    e
5    c  4.0  3.0  0    b

Idea is use DataFrame.select_dtypes for non numeric columns with DataFrame.mode and select first row by DataFrame.iloc for positions, then count means - non numeric are expluded by default, so possible use Series.append for Series with all values for replacement passed to DataFrame.fillna : Idea is use DataFrame.select_dtypes for non numeric columns with DataFrame.mode and select first row by DataFrame.iloc for positions, then count means - non numeric are expluded by default, so possible use Series.append for Series with all values for replacement passed到DataFrame.fillna

modes = df.select_dtypes(exclude=np.number).mode().iloc[0]
means = df.mean()
both = modes.append(means)
print (both)
A          c
E          b
B       4.25
C       5.25
D    2.83333
dtype: object

df.fillna(both, inplace=True)
print (df)
   A     B     C  D  E
0  a  4.00  7.00  1  b
1  c  4.25  5.25  3  b
2  c  4.00  9.00  5  c
3  c  4.25  5.25  7  b
4  e  5.00  2.00  1  e
5  c  4.00  3.00  0  b

Passed to function with DataFrame.pipe :使用 DataFrame.pipe 传递给DataFrame.pipe

def exercise4(df):
    modes = df.select_dtypes(exclude=np.number).mode().iloc[0]
    means = df.mean()
    both = modes.append(means)
    df.fillna(both, inplace=True)
    return df

df = df.pipe(exercise4)
#alternative
#df = exercise4(df)
print (df)
   A     B     C  D  E
0  a  4.00  7.00  1  b
1  c  4.25  5.25  3  b
2  c  4.00  9.00  5  c
3  c  4.25  5.25  7  b
4  e  5.00  2.00  1  e
5  c  4.00  3.00  0  b

Another idea is use DataFrame.apply , but is necessary result_type='expand' parameter with test dtypes by types.is_numeric_dtype :另一个想法是使用DataFrame.apply ,但需要result_type='expand'参数和types.is_numeric_dtype的测试数据类型:

from pandas.api.types import is_numeric_dtype

f = lambda x: x.mean() if is_numeric_dtype(x.dtype) else x.mode()[0]
df.fillna(df.apply(f, result_type='expand'), inplace=True)
print (df)
   A     B     C  D  E
0  a  4.00  7.00  1  b
1  c  4.25  5.25  3  b
2  c  4.00  9.00  5  c
3  c  4.25  5.25  7  b
4  e  5.00  2.00  1  e
5  c  4.00  3.00  0  b

Passed to function:传递给 function:

from pandas.api.types import is_numeric_dtype

def exercise4(df):
    f = lambda x: x.mean() if is_numeric_dtype(x.dtype) else x.mode()[0]
    df.fillna(df.apply(f, result_type='expand'), inplace=True)
    return df

df = df.pipe(exercise4)
#alternative
#df = exercise4(df)
print (df)

Actually you have all the ingredients already there.实际上你已经有了所有的成分。 Some of your steps can be chained though making some others obsolete.你的一些步骤可以被链接起来,尽管其他一些步骤已经过时了。

Looking at these two lines for example:例如看这两行:

mode = df2.mode()
df4 = df2.fillna(mode.iloc[0,:])

You could just replace them with df4 = df2.fillna(df2.mode().iloc[0,:] . Then instead of constantly reassigning new (sub)dataframes to variables, altering them and concatenating them you can make these alterations inplace , meaning they are applied directly to the dataframe in question. Lastly exclude='float' might work in your particular (example) case, but what if there are even more datatypes in the dataframe? A string column maybe?您可以将它们替换为df4 = df2.fillna(df2.mode().iloc[0,:] 。然后,您无需不断地将新的(子)数据帧重新分配给变量,更改它们并将它们连接起来,您可以inplace进行这些更改,这意味着它们直接应用于有问题的 dataframe。最后exclude='float'可能适用于您的特定(示例)情况,但如果 dataframe 中有更多数据类型怎么办?可能是字符串列?

My suggestion:我的建议:

def mean_mode(df):
    df.select_dtypes(np.number).fillna(df.mean(), inplace=True)
    df.select_dtypes('category').fillna(df.mode()[0], inplace=True)
    return df

You can use the _get_numeric_data() method to get the numeric columns (and consequently the categorical ones):您可以使用_get_numeric_data()方法来获取数字列(以及分类列):

numerical_col = df._get_numeric_data().columns

At this point you only need one line of code using an apply function that runs through the columns:此时,您只需要一行代码使用贯穿各列的应用 function:

fixed_df = df.apply(lambda col: col.fillna(col.mean()) if col.name in numerical_col else col.fillna(col.mode()[0]), axis=0)

You can work as follows:您可以按以下方式工作:

df = df.apply(lambda x: x.fillna(x.mode()[0]) if (x.dtypes==category) else  x.fillna(x.mean()) )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM