繁体   English   中英

在多个pandas数据帧中设置nans

[英]Set nans across multiple pandas dataframes

我有许多类似的数据帧,我想在所有数据帧中标准化nans。 例如,如果df1.loc [0,'a']中存在nan,那么对于相同的索引位置,所有其他数据帧应设置为nan。

我知道我可以对数据帧进行分组以创建一个大的多索引数据帧,但有时我发现使用相同结构的一组数据帧更容易。

这是一个例子:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), columns=['a', 'b', 'c'])
df3 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), columns=['a', 'b', 'c'])

df1.loc[3,'a'] = np.nan
df2.loc[1,'b'] = np.nan
df3.loc[0,'c'] = np.nan

print df1
print ' ' 
print df2
print ' ' 
print df3

输出:

     a   b   c
0  0.0   1   2
1  3.0   4   5
2  6.0   7   8
3  NaN  10  11

   a     b   c
0  0   1.0   2
1  3   NaN   5
2  6   7.0   8
3  9  10.0  11

   a   b     c
0  0   1   NaN
1  3   4   5.0
2  6   7   8.0
3  9  10  11.0

但是,我希望df1,df2和df3在相同的位置有nans:

print df1
     a     b     c
0  0.0   1.0   NaN
1  3.0   NaN   5.0
2  6.0   7.0   8.0
3  NaN  10.0  11.0

使用piRSquared提供的答案,我能够为不同大小的数据帧扩展它。 这是功能:

def set_nans_over_every_df(df_list):
    # Find unique index and column values
    complete_index = sorted(set([idx for df in df_list for idx in df.index]))
    complete_columns = sorted(set([idx for df in df_list for idx in df.columns]))

    # Ensure that every df has the same indexes and columns
    df_list = [df.reindex(index=complete_index, columns=complete_columns) for df in df_list]

    # Find the nans in each df and set nans in every other df at the same location     
    mask = np.isnan(np.stack([df.values for df in df_list])).any(0)
    df_list = [df.mask(mask) for df in df_list]

    return df_list

以及使用不同大小的数据帧的示例:

df1 = pd.DataFrame(np.reshape(np.arange(15), (5,3)), index=[0,1,2,3,4], columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.reshape(np.arange(12), (4,3)), index=[0,1,2,3], columns=['a', 'b', 'c'])
df3 = pd.DataFrame(np.reshape(np.arange(16), (4,4)), index=[0,1,2,3], columns=['a', 'b', 'c', 'd'])

df1.loc[3,'a'] = np.nan
df2.loc[1,'b'] = np.nan
df3.loc[0,'c'] = np.nan

df1, df2, df3 = set_nans_over_every_df([df1, df2, df3])

print df1

     a     b     c   d
0  0.0   1.0   NaN NaN
1  3.0   NaN   5.0 NaN
2  6.0   7.0   8.0 NaN
3  NaN  10.0  11.0 NaN
4  NaN   NaN   NaN NaN

您可以创建掩码然后应用于所有数据帧:

mask = df1.notnull() & df2.notnull() & df3.notnull()
print (mask)
       a      b      c
0   True   True  False
1   True  False   True
2   True   True   True
3  False   True   True

您还可以使用reduce动态设置掩码:

import functools

masks = [df1.notnull(),df2.notnull(),df3.notnull()]
mask = functools.reduce(lambda x,y: x & y, masks)
print (mask)
       a      b      c
0   True   True  False
1   True  False   True
2   True   True   True
3  False   True   True

print (df1[mask])
     a     b     c
0  0.0   1.0   NaN
1  3.0   NaN   5.0
2  6.0   7.0   8.0
3  NaN  10.0  11.0

print (df2[mask])
     a     b     c
0  0.0   1.0   NaN
1  3.0   NaN   5.0
2  6.0   7.0   8.0
3  NaN  10.0  11.0

print (df2[mask])

     a     b     c
0  0.0   1.0   NaN
1  3.0   NaN   5.0
2  6.0   7.0   8.0
3  NaN  10.0  11.0

我在numpy设置了一个mask ,然后在pd.DataFrame.mask方法中使用这个mask

mask = np.isnan(np.stack([d.values for d in [df1, df2, df3]])).any(0)

print(df1.mask(mask))

     a     b     c
0  0.0   1.0   NaN
1  3.0   NaN   5.0
2  6.0   7.0   8.0
3  NaN  10.0  11.0

print(df2.mask(mask))

     a     b     c
0  0.0   1.0   NaN
1  3.0   NaN   5.0
2  6.0   7.0   8.0
3  NaN  10.0  11.0

print(df3.mask(mask))

     a     b     c
0  0.0   1.0   NaN
1  3.0   NaN   5.0
2  6.0   7.0   8.0
3  NaN  10.0  11.0

假设您的所有DF具有相同的形状并具有相同的索引:

In [196]: df2[df1.isnull()] = df3[df1.isnull()] = np.nan

In [197]: df1[df3.isnull()] = df2[df3.isnull()] = np.nan

In [198]: df1[df2.isnull()] = df3[df2.isnull()] = np.nan

In [199]: df1
Out[199]:
     a     b     c
0  0.0   1.0   NaN
1  3.0   NaN   5.0
2  6.0   7.0   8.0
3  NaN  10.0  11.0

In [200]: df2
Out[200]:
     a     b     c
0  0.0   1.0   NaN
1  3.0   NaN   5.0
2  6.0   7.0   8.0
3  NaN  10.0  11.0

In [201]: df3
Out[201]:
     a     b     c
0  0.0   1.0   NaN
1  3.0   NaN   5.0
2  6.0   7.0   8.0
3  NaN  10.0  11.0

一种简单的方法是将DataFrames添加到一起并将结果乘以0,然后将此DataFrame单独添加到所有其他DataFrame。

df_zero = (df1 + df2 + df3) * 0
df1 + df_zero
df2 + df_zero
df3 + df_zero

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM