简体   繁体   English

用 Pandas 识别连续的 NaN

[英]Identifying consecutive NaNs with Pandas

I am reading in a bunch of CSV files (measurement data for water levels over time) to do various analysis and visualizations on them.我正在阅读一堆 CSV 文件(随时间变化的水位测量数据)以对它们进行各种分析和可视化。

Due to various reasons beyond my control, these time series often have missing data, so I do two things:由于各种我无法控制的原因,这些时间序列经常有缺失数据,所以我做了两件事:

I count them in total with我把它们一共算了

Rlength = len(RainD)   # Counts everything, including NaN
Rcount = RainD.count() # Counts only valid numbers
NaN_Number = Rlength - Rcount

and discard the dataset if I have more missing data than a certain threshold:如果丢失的数据超过某个阈值,则丢弃数据集:

Percent_Data = Rlength/100
Five_Percent = Percent_Data*5
if NaN_Number > Five_Percent:
    ...

If the number of NaN is sufficiently small, I would like to fill the gaps with如果 NaN 的数量足够小,我想用

RainD.level = RainD.level.fillna(method='pad', limit=2)

And now for the issue: It's monthly data, so if I have more than two consecutive NaNs, I also want to discard the data, since that would mean that I "guess" a whole season, or even more.现在的问题是:这是月度数据,所以如果我有两个以上连续的 NaN,我也想丢弃数据,因为这意味着我“猜测”了整个赛季,甚至更多。

The documentation for fillna doesn't really mention what happens when there is more consecutive NaNs than my specified limit=2 , but when I look at RainD.describe() before and after ...fillna... and compare it with the base CSV, it's clear that it fills the first two NaNs, and then leaves the rest as it is, instead of erroring out. fillna文档并没有真正提到当连续的 NaN 超过我指定的limit=2时会发生什么,但是当我查看RainD.describe()之前和之后...fillna...并将其与基数进行比较时CSV,很明显它填充了前两个 NaN,然后​​将其余部分保持原样,而不是出错。

So, long story short:所以,长话短说:

How do I identify a number of consecutive NaNs with Pandas, without some complicated and time consuming non-Pandas loop?如何使用 Pandas 识别多个连续的 NaN,而不需要一些复杂且耗时的非 Pandas 循环?

You can use multiple boolean conditions to test if the current value and previous value are NaN :您可以使用多个布尔条件来测试当前值和先前值是否为NaN

In [3]:

df = pd.DataFrame({'a':[1,3,np.NaN, np.NaN, 4, np.NaN, 6,7,8]})
df
Out[3]:
    a
0   1
1   3
2 NaN
3 NaN
4   4
5 NaN
6   6
7   7
8   8
In [6]:

df[(df.a.isnull()) & (df.a.shift().isnull())]
Out[6]:
    a
3 NaN

If you wanted to find where consecutive NaNs occur where you are looking for more than 2 you could do the following:如果您想找到连续的NaNs出现的位置超过 2,您可以执行以下操作:

In [38]:

df = pd.DataFrame({'a':[1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14]})
df
Out[38]:
     a
0    1
1    2
2  NaN
3  NaN
4  NaN
5    6
6    7
7    8
8    9
9   10
10 NaN
11 NaN
12  13
13  14

In [41]:

df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).sum()
Out[41]:
a
1    0
2    3
3    0
4    0
5    0
6    0
7    2
8    0
9    0
Name: a, dtype: int32

If you wish to map this back to the original index, or have a consective count of NaNs use Ed's answer with cumsum instead of sum .如果您希望将其映射回原始索引,或者连续计数 NaN,请使用 Ed 的答案cumsum而不是sum This is particularly useful for visualising NaN groups in time series:这对于可视化时间序列中的 NaN 组特别有用:

df = pd.DataFrame({'a':[
    1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14
]})

df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).cumsum()


0     0
1     0
2     1
3     2
4     3
5     0
6     0
7     0
8     0
9     0
10    1
11    2
12    0
13    0
Name: a, dtype: int64

for example,例如,

pd.concat([
        df,
        (
            df.a.isnull().astype(int)
            .groupby(df.a.notnull().astype(int).cumsum())
            .cumsum().to_frame('consec_count')
        )
    ],
    axis=1
)

    a       consec_count
0   1.0     0
1   2.0     0
2   NaN     1
3   NaN     2
4   NaN     3
5   6.0     0
6   7.0     0
7   8.0     0
8   9.0     0
9   10.0    0
10  NaN     1
11  NaN     2
12  13.0    0
13  14.0    0

If you just want to find the lengths of the consecutive NaNs ...如果您只想找到连续 NaN 的长度...

# usual imports
import pandas as pd
import numpy as np

# fake data
data = pd.Series([np.nan,1,1,1,1,1,np.nan,np.nan,np.nan,1,1,np.nan,np.nan])

# code 
na_groups = data.notna().cumsum()[data.isna()]
lengths_consecutive_na = missing_groups.groupby(missing_groups).agg(len)
longest_na_gap = lengths_consecutive_na.max()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM