简体   繁体   English

根据条件查找pandas Dataframe中行中的连续值

[英]Find consecutive values in rows in pandas Dataframe based on condition

I was looking at this question: How can I find 5 consecutive rows in pandas Dataframe where a value of a certain column is at least 0.5 , which is similar to the one I have in mind.我在看这个问题: How can I find 5 consecutive rows in pandas Dataframe 其中某一列的值至少为 0.5 ,这与我想到的相似。 I would like to find say at least 3 consecutive rows where a value is less than 0.5 (but not negative nor nan), while considering the entire dataframe and not just one column as in the question linked above.我想在考虑整个 dataframe 而不仅仅是上面链接的问题中的一列时,至少找到 3 个连续的行,其中的值小于 0.5(但不是负数或 nan)。 Here a facsimile dataframe:这里有一个传真 dataframe:

from random import uniform

idx = pd.date_range("2018-01-01", periods=10, freq="M")

df = pd.DataFrame(
    {
        'A':[0, 0.4, 0.5, 0.3, 0,0,0,0,0,0],
        'B':[0, 0.6, 0.8,0, 0.3, 0.3, 0.9, 0.7,0,0],
        'C':[0,0,0.5, 0.4, 0.4, 0.2,0,0,0,0],
        'D':[0.4,0, 0.6, 0.5, 0.7, 0.2,0, 0.9, 0.8,0],
        'E':[0.4, 0.3, 0.2, 0.7, 0.7, 0.8,0,0,0,0],
        'F':[0,0,0.6, 0.7,0.8, 0.3, 0.4, 0.1,0,0]
    },
    index=idx
)

df = df.replace({0:np.nan})

df

Hence, since columns B and D don't satisfy the criteria should be removed from the output.因此,由于列 B 和 D 不满足标准,因此应从 output 中删除。

I'd prefer not to use for loop and the like since it is a 2000-column df, therefore I tried with the following:我不想使用 for 循环等,因为它是一个 2000 列的 df,因此我尝试了以下内容:

def consecutive_values_in_range(s, min, max):

    return s.between(left=min, right=max)

min, max = 0, 0.5

df.apply(lambda col: consecutive_values_in_range(col, min, max), axis=0)

print(df)

But I didn't obtain what I was looking for, that would be something like this:但是我没有得到我想要的东西,那将是这样的:

            A     C   E   F
2018-01-31  NaN NaN 0.4 NaN
2018-02-28  0.4 NaN 0.3 NaN
2018-03-31  0.5 0.5 0.2 0.6
2018-04-30  0.3 0.4 0.7 0.7
2018-05-31  NaN 0.4 0.7 0.8
2018-06-30  NaN 0.2 0.8 0.3
2018-07-31  NaN NaN NaN 0.4
2018-08-31  NaN NaN NaN 0.1
2018-09-30  NaN NaN NaN NaN
2018-10-31  NaN NaN NaN NaN

Any suggestions?有什么建议么? Thanks in advance.提前致谢。

lower, upper = 0, 0.5
n = 3
df.loc[:, ((df <= upper) & (df >= lower)).rolling(n).sum().eq(n).any()]
  • get an is_between mask over df通过df获取 is_between 掩码
  • get the rolling sum of these masks per column, window size being 3获取每列这些掩码的滚动总和,window 大小为 3
  • since True == 1 and False == 0, if we get 3 in any point, that implies consecutive 3 True's, ie, 0 <= val <= 0.5 values in that column因为 True == 1 和 False == 0,如果我们在任何一点得到 3,这意味着连续 3 个 True,即该列中的 0 <= val <= 0.5 个值
  • so check equality against 3 and see if there's any in a column所以检查是否与 3 相等,看看列中是否
  • lastly index with the resulting True/False mask per column最后使用每列生成的 True/False 掩码进行索引

to get要得到

              A    C    E    F
2018-01-31  NaN  NaN  0.4  NaN
2018-02-28  0.4  NaN  0.3  NaN
2018-03-31  0.5  0.5  0.2  0.6
2018-04-30  0.3  0.4  0.7  0.7
2018-05-31  NaN  0.4  0.7  0.8
2018-06-30  NaN  0.2  0.8  0.3
2018-07-31  NaN  NaN  NaN  0.4
2018-08-31  NaN  NaN  NaN  0.1
2018-09-30  NaN  NaN  NaN  NaN
2018-10-31  NaN  NaN  NaN  NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM