I have the following dataframe:
id outcome
0 3 no
1 3 no
2 3 no
3 3 yes
4 3 no
5 5 no
6 5 no
7 5 yes
8 5 no
9 5 yes
10 6 no
11 6 no
12 6 yes
13 6 no
14 6 no
I want to remove the no
outcomes at the start of a sequence before a yes
, and keep all other no
outcomes, so the output dataframe looks like this:
id outcome
3 3 yes
4 3 no
7 5 yes
8 5 no
9 5 yes
12 6 yes
13 6 no
14 6 no
At the moment I have tried this:
df = pd.DataFrame(data={
'id': [3, 3, 3, 3, 3, 5, 5, 5, 5, 6, 6, 6, 6, 6],
'outcome': ['no', 'no', 'no', 'yes', 'no', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'no', 'no']
})
df = df[df.groupby('id').outcome.transform(lambda x: x.ne('no'))]
However, this simply removes all no
outcomes.
I know I then need to take the index of these rows and remove them from the dataframe. Any suggestions?
Use groupby
with cumsum
to mark all 'no' at the start with a 0:
df['no_group'] = df.groupby('id')['outcome'].apply(lambda x: x.eq('yes').cumsum())
Now, the number of 'no's to remove is:
num_no_to_remove = (df['no_group'] == 0).sum()
And the wanted dataframe can be obtained by filtering:
df.loc[df['no_group'] > 0].drop(columns=['no_group'])
Result:
id outcome
3 3 yes
4 3 no
7 5 yes
8 5 no
9 5 yes
12 6 yes
13 6 no
14 6 no
For keeping only last no
values of each group and all the yes
values, this code will do the trick:
df = df[(df.replace({'no': np.nan, 'yes': 1}).groupby('id')['outcome'].bfill() != 1) | (df['outcome'] == 'yes')]
Output:
>>> df
id outcome
3 3 yes
4 3 no
5 3 no
8 5 yes
9 5 yes
12 6 yes
(In the original df
, I added a second no
the end of group 3
to make sure it works for multiple no
's at the end).
Essentially what the code does is it
yes
values with an arbitrary value ( 1
) in this caseno
values with NaN (which is important!) yes
's are 1
and the no
's are NaN, this will cause everything except the last no
's of the group to be replaced with the arbitrary number ( 1
)no
values of each groupyes
valuesyes
values, and all no
values that are at the end of a group. For the question regarding the count of no
's at the beginning, I think you should ask a new question for that, because it's a different problem that has to be solved differently.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.