简体   繁体   English

如何根据连续和非连续的 NaN 将数据转换为 NaN?

[英]How to convert data to NaN based on consecutive and non consecutive NaNs?

I have this df :我有这个df

       CODE      DATE     TMAX  
0      000130 1991-01-01  32.6  
1      000130 1991-01-02  31.2  
2      000130 1991-01-03  32.0   
3      000130 1991-01-04  32.2  
4      000130 1991-01-05  30.5  
...      ...     ...       ... 
10865  000130 2020-12-31   NaN   
10866  000132 1991-01-01  35.2   
10867  000132 1991-01-02  34.6   
10868  000132 1991-01-03  35.8   
10869  000132 1991-01-04  34.8   
10870  000132 1991-01-05  34.8  
10871  000132 1991-01-06  34.8   
10872  000132 1991-01-07  34.8   
10873  000132 1991-01-08  34.8
...      ...     ...       ...  

I want to convert a month of TMAX data to NaN only if there is 5 or more consecutive NaN values in the month or there is 11 or more non consecutive NaN values in the month.TMAX月有 5 个或更多连续 NaN 值或当月有 11 个或更多非连续 NaN 值时,我才想将一个月的TMAX数据转换为 NaN。 I only need one condition to be met to convert the month to NaN.我只需要满足一个条件即可将月份转换为 NaN。

Example:例子:

       CODE      DATE     TMAX  
0      000130 1991-02-01  NaN  
1      000130 1991-02-02  NaN  
2      000130 1991-02-03  NaN   
3      000130 1991-02-04  NaN  
4      000130 1991-02-05  NaN  
5      000130 1991-02-06  33.8   
6      000132 1991-02-07  35.2   
7      000132 1991-02-08  NaN   
8      000132 1991-02-09  NaN   
9      000132 1991-02-10  NaN   
10     000132 1991-02-11  NaN  
11     000132 1991-02-12  NaN   
12     000132 1991-02-13  NaN   
13     000132 1991-02-14  34.8
...    ...    ...         ...

Expected value:期望值:

       CODE      DATE     TMAX  
0      000130 1991-02-01  NaN  
1      000130 1991-02-02  NaN  
2      000130 1991-02-03  NaN   
3      000130 1991-02-04  NaN  
4      000130 1991-02-05  NaN  
5      000130 1991-02-06  NaN   
6      000132 1991-02-07  NaN   
7      000132 1991-02-08  NaN   
8      000132 1991-02-09  NaN   
9      000132 1991-02-10  NaN   
10     000132 1991-02-11  NaN  
11     000132 1991-02-12  NaN   
12     000132 1991-02-13  NaN   
13     000132 1991-02-14  NaN
...    ...    ...         ...

So i wrote this code:所以我写了这段代码:

s = df['TMAX'].isnull().groupby([df['CODE'], df['DATE'].astype('datetime64[M]')]).transform('sum')
df['TMAX'] = df['TMAX'].mask(s.ge(11))

But this code is only converting a month of TMAX data to NaN when there is 11 or more non consecutive NaNs in a month.但是,当一个月中有 11 个或更多非连续 NaN 时,此代码仅将一个月的TMAX数据转换为 NaN。 I need both conditions.我需要这两个条件。 Would you mind to help me?你介意帮我吗?

Thanks in advance.提前致谢。

In my opinion your code not count consecutive values, but count all non NaN s per group and per month.在我看来,您的代码不计算连续值,而是计算每个组和每个月的所有非NaN

For consecutive it is more complicated:对于连续,它更复杂:

print (df)
    CODE        DATE  TMAX
0    130  1991-02-01   NaN < 5 consecutive NaN per 130 per 1991-02
1    130  1991-02-02   NaN
2    130  1991-02-03   NaN
3    130  1991-02-04   NaN
4    130  1991-02-05   NaN
5    130  1991-02-06  33.8
6    132  1991-02-07  35.2 < non 5 consecutive NaN per 132 per 1991-02
7    132  1991-02-08   NaN
8    132  1991-02-09   NaN
9    132  1991-02-10   NaN
10   132  1991-02-11   NaN
11   132  1991-02-12  34.8
12   132  1991-02-13   NaN
13   132  1991-02-14  34.8
14   133  1991-02-01   2.0 < 12 consecutive non NaN per 133 per 1991-02
15   133  1991-02-02   2.0
16   133  1991-02-03   2.0
17   133  1991-02-04   2.0
18   133  1991-02-05   2.0
19   133  1991-02-06  33.8
20   133  1991-02-07  35.2
21   133  1991-02-08   2.0
22   133  1991-02-09   2.0
23   133  1991-02-10   2.0
24   133  1991-02-11   2.0
25   133  1991-02-12   1.0
26   133  1991-02-13   NaN
27   133  1991-02-14  34.8

df['DATE'] = pd.to_datetime(df['DATE'])

m = df['TMAX'].notna()

#consecutive groups
a = m.cumsum().mask(m)
b = (~m).cumsum().mask(~m)

y = df.DATE.dt.year 
m = df.DATE.dt.month

#count per consecutive groups, CODE and month
s1 = a.groupby([a, df['CODE'], y, m]).transform('size')
s2 = b.groupby([b, df['CODE'], y, m]).transform('size')

#chain and test if at least one value match
m = ((s1.ge(5) | s2.ge(11))
        .groupby([df['CODE'], y, m])
        .transform('any'))

df['TMAX'] = df['TMAX'].mask(m)

print (df)
    CODE       DATE  TMAX
0    130 1991-02-01   NaN
1    130 1991-02-02   NaN
2    130 1991-02-03   NaN
3    130 1991-02-04   NaN
4    130 1991-02-05   NaN
5    130 1991-02-06   NaN
6    132 1991-02-07  35.2 <- non consecutive - no change
7    132 1991-02-08   NaN
8    132 1991-02-09   NaN
9    132 1991-02-10   NaN
10   132 1991-02-11   NaN
11   132 1991-02-12  34.8
12   132 1991-02-13   NaN
13   132 1991-02-14  34.8
14   133 1991-02-01   NaN
15   133 1991-02-02   NaN
16   133 1991-02-03   NaN
17   133 1991-02-04   NaN
18   133 1991-02-05   NaN
19   133 1991-02-06   NaN
20   133 1991-02-07   NaN
21   133 1991-02-08   NaN
22   133 1991-02-09   NaN
23   133 1991-02-10   NaN
24   133 1991-02-11   NaN
25   133 1991-02-12   NaN
26   133 1991-02-13   NaN
27   133 1991-02-14   NaN

Try the below using groupby :使用groupby尝试以下操作:

s = df['TMAX'].isnull().groupby([df['CODE'], df['DATE'].astype('datetime64[M]')]).transform('sum')
n = df['TMAX'].groupby([df['CODE'], df['DATE'].astype('datetime64[M]'), df['TMAX'].replace(np.nan, 0).diff().ne(0).cumsum()]).transform('size')
df['TMAX'] = np.nan if ((s.sum() > 11) | n.ge(5)).any() else df['TMAX']
print(df)

Output:输出:

    CODE        DATE  TMAX
0    130  1991-02-01   NaN
1    130  1991-02-02   NaN
2    130  1991-02-03   NaN
3    130  1991-02-04   NaN
4    130  1991-02-05   NaN
5    130  1991-02-06   NaN
6    132  1991-02-07   NaN
7    132  1991-02-08   NaN
8    132  1991-02-09   NaN
9    132  1991-02-10   NaN
10   132  1991-02-11   NaN
11   132  1991-02-12   NaN
12   132  1991-02-13   NaN
13   132  1991-02-14   NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM