简体   繁体   English

根据不同列子集的组合删除具有 NaN 的行

[英]Drop rows with NaNs based on combination of different columns subsets

I would like to drop all rows with NaN values based on combinations of columns subsets.我想根据列子集的组合删除所有具有 NaN 值的行。 Let's demonstrate this on simple example:让我们用一个简单的例子来证明这一点:

df = pd.DataFrame(
    [[1, np.nan, 3, 4], [1, 2, 3, 4], [np.nan, np.nan, 3, 4], [1, np.nan, np.nan, 4], [1, 2, np.nan, np.nan]],
    columns=["a1", "a2", "b1", "b2"],
)

print(df)
#    a1   a2   b1   b2
# 0  1.0  NaN  3.0  4.0
# 1  1.0  2.0  3.0  4.0
# 2  NaN  NaN  3.0  4.0
# 3  1.0  NaN  NaN  4.0
# 4  1.0  2.0  NaN  NaN

And I would like to drop rows where all features in either sets {a1, a2} or {b1, b2} are NaN.我想删除集合{a1, a2}{b1, b2}中的所有特征都是 NaN 的行。 So the output would be (rows number 2 and 4 dropped):所以 output 将是(删除第 2 行和第 4 行):

   a1   a2   b1   b2
0  1.0  NaN  3.0  4.0
1  1.0  2.0  3.0  4.0
3  1.0  NaN  NaN  4.0

I would ideally need some combination of df.dropna(how="all", subset=["a1", "a2"]) and df.dropna(how="all", subset=["b1", "b2"]) .理想情况下,我需要df.dropna(how="all", subset=["a1", "a2"])df.dropna(how="all", subset=["b1", "b2"]) In this simple case it would not be such problem but what about having eg 10 different subsets?在这个简单的情况下,这不会是这样的问题,但是如果有 10 个不同的子集呢? (And in my real scenario it's almost 50.) (在我的真实场景中,它几乎是 50。)

Is there any smart solution using pandas or any filter how to combine these subsets and make the right condition for dropna method?是否有任何使用 pandas 或任何过滤器的智能解决方案如何组合这些子集并为dropna方法创造正确的条件?

Motivation : Just to give you an idea why I need something like this it's because I have different sets of features ( a , b , ...) that are combined in a single DataFrame and I need to handle those features separately.动机:只是为了让您了解为什么我需要这样的东西,因为我有不同的功能集( ab ,...),它们组合在一个 DataFrame 中,我需要分别处理这些功能。 Some NaNs are ok, but if any row for any feature is full of NaNs, it means wrong measurement and I want to drop this row for any other feature as well (just imagine that the index is time of measurement and if a single set of features is incorrect I do not want to keep it even if other sets of features are fine).一些 NaN 是可以的,但如果任何特征的任何行都充满了 NaN,这意味着测量错误,我也想为任何其他特征删除这一行(想象一下索引是测量时间,如果一组功能不正确,即使其他功能都很好,我也不想保留它)。

Approach方法

For each subset in the list of predefined subsets you can test the columns of that subset for the presence of any non NaN value along axis=1 to create a boolean mask, then you can reduce all the boolean masks corresponding to each of the subset with np.logical_and to create a resulting boolean mask which then can be used to filter the rows in the dataframe.对于预定义subsets列表中的每个subset ,您可以测试该子集的列是否存在沿axis=1的任何非NaN值以创建 boolean 掩码,然后您可以reduce与每个子集对应的所有 boolean 掩码np.logical_and创建生成的 boolean mask ,然后可用于过滤 dataframe 中的行。

subs = [{'a1', 'a2'}, {'b1', 'b2'}]
mask = np.logical_and.reduce([df[s].notna().any(1) for s in subs])

Result结果

>>> df[mask]

    a1   a2   b1   b2
0  1.0  NaN  3.0  4.0
1  1.0  2.0  3.0  4.0
3  1.0  NaN  NaN  4.0

First compare all values for not missing values, then group by first letter of columsnn ames with GroupBy.any for test if groups has no only NaNs and last filter by DataFrame.all for get all rows matching masks:首先比较没有缺失值的所有值,然后按列的第一个字母与GroupBy.any以测试组是否只有 NaN,最后按DataFrame.all过滤以获取匹配掩码的所有行:

print (df.columns.str[0])
Index(['a', 'a', 'b', 'b'], dtype='object')


print (df.notna().groupby(df.columns.str[0], axis=1).any())
       a      b
0   True   True
1   True   True
2  False   True
3   True   True
4   True  False

df = df[df.notna().groupby(df.columns.str[0], axis=1).any().all(axis=1)]
print (df)
    a1   a2   b1   b2
0  1.0  NaN  3.0  4.0
1  1.0  2.0  3.0  4.0
3  1.0  NaN  NaN  4.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM