简体   繁体   中英

pandas dataframe - How to find consecutive rows that meet some conditions?

I'm trying to make a program that finds consecutive rows that meet some conditions. For example, if there's a dataframe that looks like this:

df = pd.DataFrame([1,1,2,-13,-4,-5,6,17,8,9,-10,-11,-12,-13,14,15], 
            index=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], 
            columns=['value'])

>>> df
    value
0       1
1       1
2       2
3     -13
4      -4
5      -5
6       6
7      17
8       8
9       9
10    -10
11    -11
12    -12
13    -13
14    -14
15     15

I want it to return a dataframe that shows rows that meet the conditions below:

1) the order has to be (positive rows) and (negative rows) , not the other way around.

2) each positive or negative group of rows has to have at least 3 rows

3) positive and negatives groups have to be adjacent to each other

          posIdx,   negIdx,  posLength,  negLength
0              2          3           3          3    # (1,1,2) (-13,-4,-5)
1              9         10           4          5    # (6,17,8,9) (-10,-11,-12,-13,-14)

Are there any simple ways to do this using python or pandas commands?

I create helper columns for easy verify solution:

#column for negative and positive
df['sign'] = np.where(df['value'] < 0, 'neg','pos')
#consecutive groups
df['g'] = df['sign'].ne(df['sign'].shift()).cumsum()

#removed groups with length more like 2
df = df[df['g'].map(df['g'].value_counts()).gt(2)]

#tested if order `pos-neg` of groups, if not removed groups
m1 = df['sign'].eq('pos') & df['sign'].shift(-1).eq('neg')
m2 = df['sign'].eq('neg') & df['sign'].shift().eq('pos')
groups = df.loc[m1 | m2, 'g']
df = df[df['g'].isin(groups)].copy()

df['pairs'] = (df['sign'].ne(df['sign'].shift()) & df['sign'].eq('pos')).cumsum()
print (df)
    value sign  g  pairs
0       1  pos  1      1
1       1  pos  1      1
2       2  pos  1      1
3     -13  neg  2      1
4      -4  neg  2      1
5      -5  neg  2      1
6       6  pos  3      2
7      17  pos  3      2
8       8  pos  3      2
9       9  pos  3      2
10    -10  neg  4      2
11    -11  neg  4      2
12    -12  neg  4      2
13    -13  neg  4      2

Last aggregate GroupBy.first for all groups and counts by GroupBy.size and named aggregation (pandas 0.25+), sorting columns and flatten MultiIndex, last correct Idx_pos for subtract 1 :

df1 = (df.reset_index()
         .groupby(['pairs','g', 'sign'])
         .agg(Idx=('index','first'),  Length=('sign','size'))
         .reset_index(level=1, drop=True)
         .unstack()
         .sort_index(axis=1, level=[0,1], ascending=[True, False])
         )
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1['Idx_pos'] = df1['Idx_neg'] - 1
print (df1)
       Idx_pos  Idx_neg  Length_pos  Length_neg
pairs                                          
1            2        3           3           3
2            9       10           4           4

This is just an alternative, and I did not benchmark this speed:

First, create a 'sign' column, indicating if a number is positive or negative.

Second, create a 'check' column as well, to indicate at what row, the change from positive to negative, or negative to positive occurred. If it is a -1, it implies a change from +ve to -ve; the reverse implies +1.

Next step, get the indices, where check is -1(neg_ids) and +1(pos_ids)
I use functions from more-itertools to intersperse the neg_ids and pos_ids. The aim is to get those chunks of rows that are altogether positive or negative.

Next phase is to run a for loop that uses the iloc function for each tuple created in the outcome variable, and find out if all the values in the 'value' column is positive or negative. Depending on the sign, we assign the results to keys in a 'K' dictionary. Note that posIdx will be the last row in that chunk (for wholly positive values), while for negIdx it will be the first row in the negative chunk. iloc does a start: end-1, so posIdx will be a end-1, while for negIdx, start does not need any addition or subtraction.

Last phase is to read the data into a dataframe

df = pd.DataFrame([1,1,2,-13,-4,-5,6,17,8,9,-10,-11,-12,-13,-14,15], 
        index=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], 
        columns=['value'])

df['sign'] = np.where(df.value.lt(0),0,1)
df['check'] = df.sign.sub(df.sign.shift().fillna(0))

neg_ids = df.loc[df.check==-1].index.tolist()
pos_ids = df.loc[df.check==1].index.tolist()

from more_itertools import interleave_longest, windowed
outcome = list(interleave_longest(pos_ids,neg_ids))
outcome = list(windowed(outcome,2))

print(outcome)

[(0, 3), (3, 6), (6, 10), (10, 15)]

from collections import defaultdict

K = defaultdict(list)

for start, end in outcome:
    checker = df.iloc[start:end,0]
    if checker.ge(0).all() and checker.shape[0]>2:
        K['posIdx'].append(end-1)
        K['posLength'].append(checker.shape[0])
    elif checker.lt(0).all() and checker.shape[0]>2:
       K['negIdx'].append(start)
       K['negLength'].append(checker.shape[0])

pd.DataFrame(K)

   posIdx   posLength   negIdx  negLength
0     2        3          3         3
1     9        4          10        5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM