简体   繁体   English

计算具有一定数量的 NaN 可接受的连续值

[英]Counting consecutive values with a certain number of NaN acceptable

There are several great answers for counting consecutive values that meet conditions, but I can't seem to find one that also permits a certain number of NaN.计算满足条件的连续值有几个很好的答案,但我似乎找不到一个也允许一定数量的 NaN 的答案。

For example, take the following dataframe:比如下面的dataframe:

Date           Val1
1900-01-01     NaN
1900-01-02     10
1900-01-03     11
1900-01-04     13
1900-01-05     NaN
1900-01-06     NaN
1900-01-07     17
1900-01-08     2
1900-01-09     NaN
1900-01-10     NaN
1900-01-11     2
1900-01-12     5
1900-01-13     6

Ideally, I want to count runs of a certain value with a certain number of NaNs acceptable.理想情况下,我想用一定数量的 NaN 来计算某个值的运行次数。 I can get the counts and run length for values, but how could I allow a certain number of NaN to be counted in the run?我可以获得值的计数和运行长度,但是我如何允许在运行中计算一定数量的 NaN?

In the above dataframe, if we permitted two NaNs and wanted values 10 or above, the run would start at 1900-01-01 and end at 1900-01-07, producing:在上面的 dataframe 中,如果我们允许两个 NaN 并希望值 10 或更高,则运行将从 1900-01-01 开始并在 1900-01-07 结束,产生:

Date           Run length
1900-01-01     7

Note that run length is 7 as the first NaN is counted in the run.请注意,运行长度为 7,因为在运行中计算了第一个 NaN。

I've tried creating two different columns counting both the length of the runs with proper values and the length of the runs with NaNs, but I'm unsure how to proceed.我尝试创建两个不同的列,计算具有正确值的运行长度和使用 NaN 的运行长度,但我不确定如何继续。 I know I can do it with pandas and I must be close, but just totally lost near the finish line!我知道我可以用 pandas 做到这一点,我必须接近,但在终点线附近完全迷失了!

Find where 'Val1' is notnull.查找“Val1”不为空的位置。 Use that to locate the consecutive groups of NaN s but first mask the original DataFrame so we only count NaN rows.使用它来定位NaN的连续组,但首先屏蔽原始 DataFrame 所以我们只计算 NaN 行。

m = df['Val1'].notnull()
s1 = df.where(~m).groupby(m.cumsum())['Date'].transform('count').le(2)

Together these two masks can be used to indicate True for 2 or fewer consecutive NaN s这两个掩码一起可用于指示 2 个或更少的连续NaN为 True

(s1 & ~m)

0      True
1     False
2     False
3     False
4      True
5      True
6     False
7     False
8      True
9      True
10    False
11    False
12    False
dtype: bool

Combine that with your condition of >=10将其与 >=10 的条件相结合

gps = (s1 & ~m) | df['Val1'].ge(10)

Use this Series to group.使用此系列进行分组。 Use where + dropna to get rid of all of the groups formed by things that do not meet the condition.使用where + dropna将不符合条件的事物组成的所有组去掉。

res = (df.where(gps).dropna(subset=['Date'])
         .groupby((~gps).cumsum())
         .agg(['first', 'count']))

#         Date        Val1      
#        first count first count
#0  1900-01-01     7  10.0     4
#1  1900-01-09     2   NaN     0

Finally, let's remove those groups that are based only on consecutive NaNs最后,让我们删除那些仅基于连续 NaN 的组

res = res.loc[res[('Val1', 'count')].ne(0), 'Date']

#        first  count
#0  1900-01-01      7

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM