How can I delete a sequence of rows based on a condition?

Question

I have the following dataframe:

    id outcome
0    3      no
1    3      no
2    3      no
3    3     yes
4    3      no
5    5      no
6    5      no
7    5     yes
8    5      no
9    5     yes
10   6      no
11   6      no
12   6     yes
13   6      no
14   6      no

I want to remove the no outcomes at the start of a sequence before a yes , and keep all other no outcomes, so the output dataframe looks like this:

    id outcome
3    3     yes
4    3      no
7    5     yes
8    5      no
9    5     yes
12   6     yes
13   6      no
14   6      no

At the moment I have tried this:

df = pd.DataFrame(data={
       'id': [3, 3, 3, 3, 3, 5, 5, 5, 5, 6, 6, 6, 6, 6], 
       'outcome': ['no', 'no', 'no', 'yes', 'no', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'no', 'no']
     })


df = df[df.groupby('id').outcome.transform(lambda x: x.ne('no'))]

However, this simply removes all no outcomes.

I know I then need to take the index of these rows and remove them from the dataframe. Any suggestions?

Answer 1

Use groupby with cumsum to mark all 'no' at the start with a 0:

df['no_group'] = df.groupby('id')['outcome'].apply(lambda x: x.eq('yes').cumsum())

Now, the number of 'no's to remove is:

num_no_to_remove = (df['no_group'] == 0).sum()

And the wanted dataframe can be obtained by filtering:

df.loc[df['no_group'] > 0].drop(columns=['no_group'])

Result:

    id  outcome
3    3      yes
4    3       no
7    5      yes
8    5       no
9    5      yes
12   6      yes
13   6       no
14   6       no

Answer 2

For keeping only last no values of each group and all the yes values, this code will do the trick:

df = df[(df.replace({'no': np.nan, 'yes': 1}).groupby('id')['outcome'].bfill() != 1) | (df['outcome'] == 'yes')]

Output:

>>> df
    id outcome
3    3     yes
4    3      no
5    3      no
8    5     yes
9    5     yes
12   6     yes

(In the original df , I added a second no the end of group 3 to make sure it works for multiple no 's at the end).

Essentially what the code does is it

Replaces yes values with an arbitrary value ( 1 ) in this case
Replaces no values with NaN (which is important!)
Groups the rows by their ID
For each group, replace all NaN rows coming before the last non-NaN row with the value of the last non-NaN row. Since the yes 's are 1 and the no 's are NaN, this will cause everything except the last no 's of the group to be replaced with the arbitrary number ( 1 )
Creates a mask which selects all those last no values of each group
Creates a second mask which selects all yes values
Uses those two masks combined to return , and all yes values, and all no values that are at the end of a group.

For the question regarding the count of no 's at the beginning, I think you should ask a new question for that, because it's a different problem that has to be solved differently.

How can I delete a sequence of rows based on a condition?

Question

2 answers

solution1
2 ACCPTED 2021-11-17 02:07:59

solution2
1 2021-11-17 01:52:45

How can I delete a sequence of rows based on a condition?

Question

2 answers

solution1 2 ACCPTED 2021-11-17 02:07:59

solution2 1 2021-11-17 01:52:45

solution1
2 ACCPTED 2021-11-17 02:07:59

solution2
1 2021-11-17 01:52:45