从pandas DataFrame中删除少于K个连续的NaN

Question

I am working Time Series data. 我正在处理时间序列数据。 I am facing problem while removing consecutive NaNs less than or equal to threshold from a Data Frame column. 从数据框列中删除小于或等于阈值的连续NaN时，我遇到问题。 I tried looking at some of the links like: 我试着查看一些链接，如：

Identifying consecutive NaN's with pandas : Identifies where consecutive NaNs are present and what is count. 使用pandas识别连续的NaN ：识别连续NaN存在的位置和计数。

Pandas: run length of NaN holes : Outputs run Length encoding for NaNs Pandas：运行NaN空洞的长度：输出运行NaNs的长度编码

There are many more others along this lane, but none of them actually tells how can we remove them after identifying. 沿着这条车道还有更多的其他车道，但它们都没有告诉我们如何在识别之后将它们移除。

I found one similar solution but that is in R : How to remove more than 2 consecutive NA's in a column? 我找到了一个类似的解决方案但是在R中：如何在一列中删除超过2个连续的NA？

I want solution in Python. 我想用Python解决方案。

So here is the example: 所以这是一个例子：

Here is my dataframe column: 这是我的dataframe专栏：

If k = 3, my output should be: 如果k = 3，我的输出应该是：

How can I go about removing the consecutive NaNs less than or equal to some threshold (k). 如何去除小于或等于某个阈值（k）的连续NaN。

Answer 1

There are a few ways, but this is how I've done it: 有几种方法，但这就是我做到的方式：

Determine groups of consecutive numbers using a neat cumsum trick 使用整洁的cumsum技巧确定连续数字组
Use groupby + transform to determine the size of each group 使用groupby + transform确定每个组的大小
Identify groups of NaNs that are within the threshold 识别阈值范围内的NaN组
Filter them out with boolean indexing. 使用布尔索引过滤掉它们。

k = 3 
i = df.a.isnull()
m = ~(df.groupby(i.ne(i.shift()).cumsum().values).a.transform('size').le(k) & i)

df[m]

a
0   36.45
1   35.45
5   37.21
6   35.63
7   36.45
8   34.65
9   31.45
12  36.71
13  35.55
14    NaN
15    NaN
16    NaN
17    NaN
18  37.71

You can perform df = df[m]; df.reset_index(drop=True) 你可以执行df = df[m]; df.reset_index(drop=True) df = df[m]; df.reset_index(drop=True) step at the end if you want a monotonically increasing integer index. df = df[m]; df.reset_index(drop=True)如果你想要一个单调递增的整数索引，最后一步。

Answer 2

You can create a indicator column to count the consecutive nans. 您可以创建一个指标列来计算连续的nans。

k = 3
(
df.groupby(pd.notna(df.a).cumsum())
.apply(lambda x: x.dropna() if pd.isna(x.a).sum() <= k else x)
.reset_index(drop=True)
)

Out[375]: 
        a
0   36.45
1   35.45
2   37.21
3   35.63
4   36.45
5   34.65
6   31.45
7   36.71
8   35.55
9     NaN
10    NaN
11    NaN
12    NaN
13  37.71

从pandas DataFrame中删除少于K个连续的NaN

问题描述

2 个解决方案

解决方案1
3 已采纳 2018-02-15 05:23:16

解决方案2
0 2018-02-15 05:44:04

从pandas DataFrame中删除少于K个连续的NaN

问题描述

2 个解决方案

解决方案1 3 已采纳 2018-02-15 05:23:16

解决方案2 0 2018-02-15 05:44:04

解决方案1
3 已采纳 2018-02-15 05:23:16

解决方案2
0 2018-02-15 05:44:04