[英]Drop rows with NaNs based on combination of different columns subsets
I would like to drop all rows with NaN values based on combinations of columns subsets.我想根据列子集的组合删除所有具有 NaN 值的行。 Let's demonstrate this on simple example:
让我们用一个简单的例子来证明这一点:
df = pd.DataFrame(
[[1, np.nan, 3, 4], [1, 2, 3, 4], [np.nan, np.nan, 3, 4], [1, np.nan, np.nan, 4], [1, 2, np.nan, np.nan]],
columns=["a1", "a2", "b1", "b2"],
)
print(df)
# a1 a2 b1 b2
# 0 1.0 NaN 3.0 4.0
# 1 1.0 2.0 3.0 4.0
# 2 NaN NaN 3.0 4.0
# 3 1.0 NaN NaN 4.0
# 4 1.0 2.0 NaN NaN
And I would like to drop rows where all features in either sets {a1, a2}
or {b1, b2}
are NaN.我想删除集合
{a1, a2}
或{b1, b2}
中的所有特征都是 NaN 的行。 So the output would be (rows number 2 and 4 dropped):所以 output 将是(删除第 2 行和第 4 行):
a1 a2 b1 b2
0 1.0 NaN 3.0 4.0
1 1.0 2.0 3.0 4.0
3 1.0 NaN NaN 4.0
I would ideally need some combination of df.dropna(how="all", subset=["a1", "a2"])
and df.dropna(how="all", subset=["b1", "b2"])
.理想情况下,我需要
df.dropna(how="all", subset=["a1", "a2"])
和df.dropna(how="all", subset=["b1", "b2"])
。 In this simple case it would not be such problem but what about having eg 10 different subsets?在这个简单的情况下,这不会是这样的问题,但是如果有 10 个不同的子集呢? (And in my real scenario it's almost 50.)
(在我的真实场景中,它几乎是 50。)
Is there any smart solution using pandas or any filter how to combine these subsets and make the right condition for dropna
method?是否有任何使用 pandas 或任何过滤器的智能解决方案如何组合这些子集并为
dropna
方法创造正确的条件?
Motivation : Just to give you an idea why I need something like this it's because I have different sets of features ( a
, b
, ...) that are combined in a single DataFrame and I need to handle those features separately.动机:只是为了让您了解为什么我需要这样的东西,因为我有不同的功能集(
a
, b
,...),它们组合在一个 DataFrame 中,我需要分别处理这些功能。 Some NaNs are ok, but if any row for any feature is full of NaNs, it means wrong measurement and I want to drop this row for any other feature as well (just imagine that the index is time of measurement and if a single set of features is incorrect I do not want to keep it even if other sets of features are fine).一些 NaN 是可以的,但如果任何特征的任何行都充满了 NaN,这意味着测量错误,我也想为任何其他特征删除这一行(想象一下索引是测量时间,如果一组功能不正确,即使其他功能都很好,我也不想保留它)。
For each subset
in the list of predefined subsets
you can test the columns of that subset for the presence of any non NaN
value along axis=1
to create a boolean mask, then you can reduce
all the boolean masks corresponding to each of the subset with np.logical_and
to create a resulting boolean mask
which then can be used to filter the rows in the dataframe.对于预定义
subsets
列表中的每个subset
,您可以测试该子集的列是否存在沿axis=1
的任何非NaN
值以创建 boolean 掩码,然后您可以reduce
与每个子集对应的所有 boolean 掩码np.logical_and
创建生成的 boolean mask
,然后可用于过滤 dataframe 中的行。
subs = [{'a1', 'a2'}, {'b1', 'b2'}]
mask = np.logical_and.reduce([df[s].notna().any(1) for s in subs])
>>> df[mask]
a1 a2 b1 b2
0 1.0 NaN 3.0 4.0
1 1.0 2.0 3.0 4.0
3 1.0 NaN NaN 4.0
First compare all values for not missing values, then group by first letter of columsnn ames with GroupBy.any
for test if groups has no only NaNs and last filter by DataFrame.all
for get all rows matching masks:首先比较没有缺失值的所有值,然后按列的第一个字母与
GroupBy.any
以测试组是否只有 NaN,最后按DataFrame.all
过滤以获取匹配掩码的所有行:
print (df.columns.str[0])
Index(['a', 'a', 'b', 'b'], dtype='object')
print (df.notna().groupby(df.columns.str[0], axis=1).any())
a b
0 True True
1 True True
2 False True
3 True True
4 True False
df = df[df.notna().groupby(df.columns.str[0], axis=1).any().all(axis=1)]
print (df)
a1 a2 b1 b2
0 1.0 NaN 3.0 4.0
1 1.0 2.0 3.0 4.0
3 1.0 NaN NaN 4.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.