[英]How to convert data to NaN based on consecutive and non consecutive NaNs?
I have this df
:我有这个df
:
CODE DATE TMAX
0 000130 1991-01-01 32.6
1 000130 1991-01-02 31.2
2 000130 1991-01-03 32.0
3 000130 1991-01-04 32.2
4 000130 1991-01-05 30.5
... ... ... ...
10865 000130 2020-12-31 NaN
10866 000132 1991-01-01 35.2
10867 000132 1991-01-02 34.6
10868 000132 1991-01-03 35.8
10869 000132 1991-01-04 34.8
10870 000132 1991-01-05 34.8
10871 000132 1991-01-06 34.8
10872 000132 1991-01-07 34.8
10873 000132 1991-01-08 34.8
... ... ... ...
I want to convert a month of TMAX
data to NaN only if there is 5 or more consecutive NaN values in the month or there is 11 or more non consecutive NaN values in the month.仅TMAX
月有 5 个或更多连续 NaN 值或当月有 11 个或更多非连续 NaN 值时,我才想将一个月的TMAX
数据转换为 NaN。 I only need one condition to be met to convert the month to NaN.我只需要满足一个条件即可将月份转换为 NaN。
Example:例子:
CODE DATE TMAX
0 000130 1991-02-01 NaN
1 000130 1991-02-02 NaN
2 000130 1991-02-03 NaN
3 000130 1991-02-04 NaN
4 000130 1991-02-05 NaN
5 000130 1991-02-06 33.8
6 000132 1991-02-07 35.2
7 000132 1991-02-08 NaN
8 000132 1991-02-09 NaN
9 000132 1991-02-10 NaN
10 000132 1991-02-11 NaN
11 000132 1991-02-12 NaN
12 000132 1991-02-13 NaN
13 000132 1991-02-14 34.8
... ... ... ...
Expected value:期望值:
CODE DATE TMAX
0 000130 1991-02-01 NaN
1 000130 1991-02-02 NaN
2 000130 1991-02-03 NaN
3 000130 1991-02-04 NaN
4 000130 1991-02-05 NaN
5 000130 1991-02-06 NaN
6 000132 1991-02-07 NaN
7 000132 1991-02-08 NaN
8 000132 1991-02-09 NaN
9 000132 1991-02-10 NaN
10 000132 1991-02-11 NaN
11 000132 1991-02-12 NaN
12 000132 1991-02-13 NaN
13 000132 1991-02-14 NaN
... ... ... ...
So i wrote this code:所以我写了这段代码:
s = df['TMAX'].isnull().groupby([df['CODE'], df['DATE'].astype('datetime64[M]')]).transform('sum')
df['TMAX'] = df['TMAX'].mask(s.ge(11))
But this code is only converting a month of TMAX
data to NaN when there is 11 or more non consecutive NaNs in a month.但是,当一个月中有 11 个或更多非连续 NaN 时,此代码仅将一个月的TMAX
数据转换为 NaN。 I need both conditions.我需要这两个条件。 Would you mind to help me?你介意帮我吗?
Thanks in advance.提前致谢。
In my opinion your code not count consecutive values, but count all non NaN
s per group and per month.在我看来,您的代码不计算连续值,而是计算每个组和每个月的所有非NaN
。
For consecutive it is more complicated:对于连续,它更复杂:
print (df)
CODE DATE TMAX
0 130 1991-02-01 NaN < 5 consecutive NaN per 130 per 1991-02
1 130 1991-02-02 NaN
2 130 1991-02-03 NaN
3 130 1991-02-04 NaN
4 130 1991-02-05 NaN
5 130 1991-02-06 33.8
6 132 1991-02-07 35.2 < non 5 consecutive NaN per 132 per 1991-02
7 132 1991-02-08 NaN
8 132 1991-02-09 NaN
9 132 1991-02-10 NaN
10 132 1991-02-11 NaN
11 132 1991-02-12 34.8
12 132 1991-02-13 NaN
13 132 1991-02-14 34.8
14 133 1991-02-01 2.0 < 12 consecutive non NaN per 133 per 1991-02
15 133 1991-02-02 2.0
16 133 1991-02-03 2.0
17 133 1991-02-04 2.0
18 133 1991-02-05 2.0
19 133 1991-02-06 33.8
20 133 1991-02-07 35.2
21 133 1991-02-08 2.0
22 133 1991-02-09 2.0
23 133 1991-02-10 2.0
24 133 1991-02-11 2.0
25 133 1991-02-12 1.0
26 133 1991-02-13 NaN
27 133 1991-02-14 34.8
df['DATE'] = pd.to_datetime(df['DATE'])
m = df['TMAX'].notna()
#consecutive groups
a = m.cumsum().mask(m)
b = (~m).cumsum().mask(~m)
y = df.DATE.dt.year
m = df.DATE.dt.month
#count per consecutive groups, CODE and month
s1 = a.groupby([a, df['CODE'], y, m]).transform('size')
s2 = b.groupby([b, df['CODE'], y, m]).transform('size')
#chain and test if at least one value match
m = ((s1.ge(5) | s2.ge(11))
.groupby([df['CODE'], y, m])
.transform('any'))
df['TMAX'] = df['TMAX'].mask(m)
print (df)
CODE DATE TMAX
0 130 1991-02-01 NaN
1 130 1991-02-02 NaN
2 130 1991-02-03 NaN
3 130 1991-02-04 NaN
4 130 1991-02-05 NaN
5 130 1991-02-06 NaN
6 132 1991-02-07 35.2 <- non consecutive - no change
7 132 1991-02-08 NaN
8 132 1991-02-09 NaN
9 132 1991-02-10 NaN
10 132 1991-02-11 NaN
11 132 1991-02-12 34.8
12 132 1991-02-13 NaN
13 132 1991-02-14 34.8
14 133 1991-02-01 NaN
15 133 1991-02-02 NaN
16 133 1991-02-03 NaN
17 133 1991-02-04 NaN
18 133 1991-02-05 NaN
19 133 1991-02-06 NaN
20 133 1991-02-07 NaN
21 133 1991-02-08 NaN
22 133 1991-02-09 NaN
23 133 1991-02-10 NaN
24 133 1991-02-11 NaN
25 133 1991-02-12 NaN
26 133 1991-02-13 NaN
27 133 1991-02-14 NaN
Try the below using groupby
:使用groupby
尝试以下操作:
s = df['TMAX'].isnull().groupby([df['CODE'], df['DATE'].astype('datetime64[M]')]).transform('sum')
n = df['TMAX'].groupby([df['CODE'], df['DATE'].astype('datetime64[M]'), df['TMAX'].replace(np.nan, 0).diff().ne(0).cumsum()]).transform('size')
df['TMAX'] = np.nan if ((s.sum() > 11) | n.ge(5)).any() else df['TMAX']
print(df)
Output:输出:
CODE DATE TMAX
0 130 1991-02-01 NaN
1 130 1991-02-02 NaN
2 130 1991-02-03 NaN
3 130 1991-02-04 NaN
4 130 1991-02-05 NaN
5 130 1991-02-06 NaN
6 132 1991-02-07 NaN
7 132 1991-02-08 NaN
8 132 1991-02-09 NaN
9 132 1991-02-10 NaN
10 132 1991-02-11 NaN
11 132 1991-02-12 NaN
12 132 1991-02-13 NaN
13 132 1991-02-14 NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.