简体   繁体   English

pandas 数据框 - 如何找到满足某些条件的连续行?

[英]pandas dataframe - How to find consecutive rows that meet some conditions?

I'm trying to make a program that finds consecutive rows that meet some conditions.我正在尝试制作一个程序来查找满足某些条件的连续行。 For example, if there's a dataframe that looks like this:例如,如果有一个如下所示的数据框:

df = pd.DataFrame([1,1,2,-13,-4,-5,6,17,8,9,-10,-11,-12,-13,14,15], 
            index=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], 
            columns=['value'])

>>> df
    value
0       1
1       1
2       2
3     -13
4      -4
5      -5
6       6
7      17
8       8
9       9
10    -10
11    -11
12    -12
13    -13
14    -14
15     15

I want it to return a dataframe that shows rows that meet the conditions below:我希望它返回一个数据框,显示满足以下条件的行:

1) the order has to be (positive rows) and (negative rows) , not the other way around. 1) 顺序必须是(positive rows)(negative rows) ,而不是相反。

2) each positive or negative group of rows has to have at least 3 rows 2) 每组正或负的行必须至少有 3 行

3) positive and negatives groups have to be adjacent to each other 3)正负组必须彼此相邻

          posIdx,   negIdx,  posLength,  negLength
0              2          3           3          3    # (1,1,2) (-13,-4,-5)
1              9         10           4          5    # (6,17,8,9) (-10,-11,-12,-13,-14)

Are there any simple ways to do this using python or pandas commands?有什么简单的方法可以使用 python 或 pandas 命令来做到这一点吗?

I create helper columns for easy verify solution:我创建了辅助列以方便验证解决方案:

#column for negative and positive
df['sign'] = np.where(df['value'] < 0, 'neg','pos')
#consecutive groups
df['g'] = df['sign'].ne(df['sign'].shift()).cumsum()

#removed groups with length more like 2
df = df[df['g'].map(df['g'].value_counts()).gt(2)]

#tested if order `pos-neg` of groups, if not removed groups
m1 = df['sign'].eq('pos') & df['sign'].shift(-1).eq('neg')
m2 = df['sign'].eq('neg') & df['sign'].shift().eq('pos')
groups = df.loc[m1 | m2, 'g']
df = df[df['g'].isin(groups)].copy()

df['pairs'] = (df['sign'].ne(df['sign'].shift()) & df['sign'].eq('pos')).cumsum()
print (df)
    value sign  g  pairs
0       1  pos  1      1
1       1  pos  1      1
2       2  pos  1      1
3     -13  neg  2      1
4      -4  neg  2      1
5      -5  neg  2      1
6       6  pos  3      2
7      17  pos  3      2
8       8  pos  3      2
9       9  pos  3      2
10    -10  neg  4      2
11    -11  neg  4      2
12    -12  neg  4      2
13    -13  neg  4      2

Last aggregate GroupBy.first for all groups and counts by GroupBy.size and named aggregation (pandas 0.25+), sorting columns and flatten MultiIndex, last correct Idx_pos for subtract 1 :最后聚合GroupBy.first所有组和计数GroupBy.size和命名聚合(熊猫 0.25+),排序列和展平 MultiIndex,最后一个正确的Idx_pos减去1

df1 = (df.reset_index()
         .groupby(['pairs','g', 'sign'])
         .agg(Idx=('index','first'),  Length=('sign','size'))
         .reset_index(level=1, drop=True)
         .unstack()
         .sort_index(axis=1, level=[0,1], ascending=[True, False])
         )
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1['Idx_pos'] = df1['Idx_neg'] - 1
print (df1)
       Idx_pos  Idx_neg  Length_pos  Length_neg
pairs                                          
1            2        3           3           3
2            9       10           4           4

This is just an alternative, and I did not benchmark this speed:这只是一个替代方案,我没有对这个速度进行基准测试:

First, create a 'sign' column, indicating if a number is positive or negative.首先,创建一个“符号”列,指示一个数字是正数还是负数。

Second, create a 'check' column as well, to indicate at what row, the change from positive to negative, or negative to positive occurred.其次,还要创建一个“检查”列,以指示从正到负或从负到正的变化发生在哪一行。 If it is a -1, it implies a change from +ve to -ve;如果是-1,则表示从+ve 变为-ve; the reverse implies +1.反过来意味着+1。

Next step, get the indices, where check is -1(neg_ids) and +1(pos_ids)下一步,获取索引,其中 check 是 -1(neg_ids) 和 +1(pos_ids)
I use functions from more-itertools to intersperse the neg_ids and pos_ids.我使用more-itertools 中的函数来散布 neg_ids 和 pos_ids。 The aim is to get those chunks of rows that are altogether positive or negative.目的是获得那些完全为正或负的行块。

Next phase is to run a for loop that uses the iloc function for each tuple created in the outcome variable, and find out if all the values in the 'value' column is positive or negative.下一阶段是运行一个 for 循环,该循环对结果变量中创建的每个元组使用 iloc 函数,并找出“值”列中的所有值是正数还是负数。 Depending on the sign, we assign the results to keys in a 'K' dictionary.根据符号,我们将结果分配给“K”字典中的键。 Note that posIdx will be the last row in that chunk (for wholly positive values), while for negIdx it will be the first row in the negative chunk.请注意,posIdx 将是该块中的最后一行(对于完全正值),而对于 negIdx,它将是负块中的第一行。 iloc does a start: end-1, so posIdx will be a end-1, while for negIdx, start does not need any addition or subtraction. iloc 执行 start: end-1,因此 posIdx 将是 end-1,而对于 negIdx,start 不需要任何加法或减法。

Last phase is to read the data into a dataframe最后一个阶段是将数据读入数据帧

df = pd.DataFrame([1,1,2,-13,-4,-5,6,17,8,9,-10,-11,-12,-13,-14,15], 
        index=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], 
        columns=['value'])

df['sign'] = np.where(df.value.lt(0),0,1)
df['check'] = df.sign.sub(df.sign.shift().fillna(0))

neg_ids = df.loc[df.check==-1].index.tolist()
pos_ids = df.loc[df.check==1].index.tolist()

from more_itertools import interleave_longest, windowed
outcome = list(interleave_longest(pos_ids,neg_ids))
outcome = list(windowed(outcome,2))

print(outcome)

[(0, 3), (3, 6), (6, 10), (10, 15)]

from collections import defaultdict

K = defaultdict(list)

for start, end in outcome:
    checker = df.iloc[start:end,0]
    if checker.ge(0).all() and checker.shape[0]>2:
        K['posIdx'].append(end-1)
        K['posLength'].append(checker.shape[0])
    elif checker.lt(0).all() and checker.shape[0]>2:
       K['negIdx'].append(start)
       K['negLength'].append(checker.shape[0])

pd.DataFrame(K)

   posIdx   posLength   negIdx  negLength
0     2        3          3         3
1     9        4          10        5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas,如何找到满足一定条件的行,并在新的dataframe中保存上一行 - Pandas, how to find rows that meet certain conditions and save the previous row in a new dataframe Pandas:如何过滤掉符合条件的行? - Pandas: How to delete the rows that meet conditions by filter? 如何找到 pandas dataframe 中连续三行之间的移位? - How to find shift between three consecutive rows in pandas dataframe? 如何在某些条件下组合 pandas 中的连续行 - How to combine consecutive rows in pandas with certain conditions 计算有多少连续行满足条件 pandas - Count how many consecutive rows meet a condition with pandas 在某些条件下,随机对Pandas中DataFrame的行进行二次采样 - Randomly subsampling the rows of a DataFrame in Pandas, with some conditions 如何从满足条件 A 或 B 的 pandas DataFrame 中获取 select 数据? - How to select data from a pandas DataFrame that meet conditions A or B? 如何更改满足特定条件的熊猫行的值? - How to change value for rows that meet specific conditions in pandas? 如何将熊猫DataFrame中的2行与连续时间戳合并? - How to combine 2 rows in pandas DataFrame with consecutive timestamp? 如果还有另一行与数据中的某些条件匹配,则从pandas DataFrame中查找行 - Find rows from a pandas DataFrame if there is another row matching some conditions in the data
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM