pandas 数据框 - 如何找到满足某些条件的连续行？

Question

I'm trying to make a program that finds consecutive rows that meet some conditions.我正在尝试制作一个程序来查找满足某些条件的连续行。 For example, if there's a dataframe that looks like this:例如，如果有一个如下所示的数据框：

df = pd.DataFrame([1,1,2,-13,-4,-5,6,17,8,9,-10,-11,-12,-13,14,15], 
            index=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], 
            columns=['value'])

>>> df
    value
0       1
1       1
2       2
3     -13
4      -4
5      -5
6       6
7      17
8       8
9       9
10    -10
11    -11
12    -12
13    -13
14    -14
15     15

I want it to return a dataframe that shows rows that meet the conditions below:我希望它返回一个数据框，显示满足以下条件的行：

1) the order has to be (positive rows) and (negative rows) , not the other way around. 1) 顺序必须是(positive rows)和(negative rows) ，而不是相反。

2) each positive or negative group of rows has to have at least 3 rows 2) 每组正或负的行必须至少有 3 行

3) positive and negatives groups have to be adjacent to each other 3）正负组必须彼此相邻

          posIdx,   negIdx,  posLength,  negLength
0              2          3           3          3    # (1,1,2) (-13,-4,-5)
1              9         10           4          5    # (6,17,8,9) (-10,-11,-12,-13,-14)

Are there any simple ways to do this using python or pandas commands?有什么简单的方法可以使用 python 或 pandas 命令来做到这一点吗？

Answer 1

I create helper columns for easy verify solution:我创建了辅助列以方便验证解决方案：

#column for negative and positive
df['sign'] = np.where(df['value'] < 0, 'neg','pos')
#consecutive groups
df['g'] = df['sign'].ne(df['sign'].shift()).cumsum()

#removed groups with length more like 2
df = df[df['g'].map(df['g'].value_counts()).gt(2)]

#tested if order `pos-neg` of groups, if not removed groups
m1 = df['sign'].eq('pos') & df['sign'].shift(-1).eq('neg')
m2 = df['sign'].eq('neg') & df['sign'].shift().eq('pos')
groups = df.loc[m1 | m2, 'g']
df = df[df['g'].isin(groups)].copy()

df['pairs'] = (df['sign'].ne(df['sign'].shift()) & df['sign'].eq('pos')).cumsum()
print (df)
    value sign  g  pairs
0       1  pos  1      1
1       1  pos  1      1
2       2  pos  1      1
3     -13  neg  2      1
4      -4  neg  2      1
5      -5  neg  2      1
6       6  pos  3      2
7      17  pos  3      2
8       8  pos  3      2
9       9  pos  3      2
10    -10  neg  4      2
11    -11  neg  4      2
12    -12  neg  4      2
13    -13  neg  4      2

Last aggregate GroupBy.first for all groups and counts by GroupBy.size and named aggregation (pandas 0.25+), sorting columns and flatten MultiIndex, last correct Idx_pos for subtract 1 :最后聚合GroupBy.first所有组和计数GroupBy.size和命名聚合（熊猫 0.25+），排序列和展平 MultiIndex，最后一个正确的Idx_pos减去1 ：

df1 = (df.reset_index()
         .groupby(['pairs','g', 'sign'])
         .agg(Idx=('index','first'),  Length=('sign','size'))
         .reset_index(level=1, drop=True)
         .unstack()
         .sort_index(axis=1, level=[0,1], ascending=[True, False])
         )
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1['Idx_pos'] = df1['Idx_neg'] - 1
print (df1)
       Idx_pos  Idx_neg  Length_pos  Length_neg
pairs                                          
1            2        3           3           3
2            9       10           4           4

Answer 2

This is just an alternative, and I did not benchmark this speed:这只是一个替代方案，我没有对这个速度进行基准测试：

First, create a 'sign' column, indicating if a number is positive or negative.首先，创建一个“符号”列，指示一个数字是正数还是负数。

Second, create a 'check' column as well, to indicate at what row, the change from positive to negative, or negative to positive occurred.其次，还要创建一个“检查”列，以指示从正到负或从负到正的变化发生在哪一行。 If it is a -1, it implies a change from +ve to -ve;如果是-1，则表示从+ve 变为-ve； the reverse implies +1.反过来意味着+1。

Next step, get the indices, where check is -1(neg_ids) and +1(pos_ids)下一步，获取索引，其中 check 是 -1(neg_ids) 和 +1(pos_ids)
I use functions from more-itertools to intersperse the neg_ids and pos_ids.我使用more-itertools 中的函数来散布 neg_ids 和 pos_ids。 The aim is to get those chunks of rows that are altogether positive or negative.目的是获得那些完全为正或负的行块。

Next phase is to run a for loop that uses the iloc function for each tuple created in the outcome variable, and find out if all the values in the 'value' column is positive or negative.下一阶段是运行一个 for 循环，该循环对结果变量中创建的每个元组使用 iloc 函数，并找出“值”列中的所有值是正数还是负数。 Depending on the sign, we assign the results to keys in a 'K' dictionary.根据符号，我们将结果分配给“K”字典中的键。 Note that posIdx will be the last row in that chunk (for wholly positive values), while for negIdx it will be the first row in the negative chunk.请注意，posIdx 将是该块中的最后一行（对于完全正值），而对于 negIdx，它将是负块中的第一行。 iloc does a start: end-1, so posIdx will be a end-1, while for negIdx, start does not need any addition or subtraction. iloc 执行 start: end-1，因此 posIdx 将是 end-1，而对于 negIdx，start 不需要任何加法或减法。

Last phase is to read the data into a dataframe最后一个阶段是将数据读入数据帧

df = pd.DataFrame([1,1,2,-13,-4,-5,6,17,8,9,-10,-11,-12,-13,-14,15], 
        index=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], 
        columns=['value'])

df['sign'] = np.where(df.value.lt(0),0,1)
df['check'] = df.sign.sub(df.sign.shift().fillna(0))

neg_ids = df.loc[df.check==-1].index.tolist()
pos_ids = df.loc[df.check==1].index.tolist()

from more_itertools import interleave_longest, windowed
outcome = list(interleave_longest(pos_ids,neg_ids))
outcome = list(windowed(outcome,2))

print(outcome)

[(0, 3), (3, 6), (6, 10), (10, 15)]

from collections import defaultdict

K = defaultdict(list)

for start, end in outcome:
    checker = df.iloc[start:end,0]
    if checker.ge(0).all() and checker.shape[0]>2:
        K['posIdx'].append(end-1)
        K['posLength'].append(checker.shape[0])
    elif checker.lt(0).all() and checker.shape[0]>2:
       K['negIdx'].append(start)
       K['negLength'].append(checker.shape[0])

pd.DataFrame(K)

   posIdx   posLength   negIdx  negLength
0     2        3          3         3
1     9        4          10        5

pandas 数据框 - 如何找到满足某些条件的连续行？

问题描述

2 个解决方案

解决方案1
5 已采纳 2020-02-21 08:51:29

解决方案2
0 2020-02-21 10:47:53

pandas 数据框 - 如何找到满足某些条件的连续行？

问题描述

2 个解决方案

解决方案1 5 已采纳 2020-02-21 08:51:29

解决方案2 0 2020-02-21 10:47:53

解决方案1
5 已采纳 2020-02-21 08:51:29

解决方案2
0 2020-02-21 10:47:53