I have a Pandas DataFrame that looks similar to:
import pandas as pd
df = pd.DataFrame([['a', '2018-09-30 00:03:00', 'that is a glove'],
['b', '2018-09-30 00:04:00', 'this is a glove'],
['b', '2018-09-30 00:09:00', 'she has ball'],
['a', '2018-09-30 00:05:00', 'they have a ball'],
['a', '2018-09-30 00:01:00', 'she has a shoe'],
['c', '2018-09-30 00:04:00', 'I have a baseball'],
['a', '2018-09-30 00:02:00', 'this is a hat'],
['a', '2018-09-30 00:06:00', 'he has no helmet'],
['b', '2018-09-30 00:11:00', 'he has no shoe'],
['c', '2018-09-30 00:02:00', 'we have a hat'],
['a', '2018-09-30 00:04:00', 'we have a baseball'],
['c', '2018-09-30 00:06:00', 'they have no glove'],
],
columns=['id', 'time', 'equipment'])
id time equipment
0 a 2018-09-30 00:03:00 that is a glove
1 b 2018-09-30 00:04:00 this is a glove
2 b 2018-09-30 00:09:00 she has ball
3 a 2018-09-30 00:05:00 they have a ball
4 a 2018-09-30 00:01:00 she has a shoe
5 c 2018-09-30 00:04:00 I have a baseball
6 a 2018-09-30 00:02:00 this is a hat
7 a 2018-09-30 00:06:00 he has no helmet
8 b 2018-09-30 00:11:00 he has no shoe
9 c 2018-09-30 00:02:00 we have a hat
10 a 2018-09-30 00:04:00 we have a baseball
11 c 2018-09-30 00:06:00 they have no glove
What I'd like to do is groupby
the id
and, within each group, sort by the time
and then return every row up to and including the row that has the word "ball". So far, I can group and sort:
df.groupby('id').apply(lambda x: x.sort_values(['time'], ascending=True)).reset_index(drop=True)
id time equipment
0 a 2018-09-30 00:01:00 she has a shoe
1 a 2018-09-30 00:02:00 this is a hat
2 a 2018-09-30 00:03:00 that is a glove
3 a 2018-09-30 00:04:00 we have a baseball
4 a 2018-09-30 00:05:00 they have a ball
5 a 2018-09-30 00:06:00 he has no helmet
6 b 2018-09-30 00:04:00 this is a glove
7 b 2018-09-30 00:09:00 she has ball
8 b 2018-09-30 00:11:00 he has no shoe
9 c 2018-09-30 00:02:00 we have a hat
10 c 2018-09-30 00:04:00 I have a baseball
11 c 2018-09-30 00:06:00 they have no glove
However, I want the output to look like:
id time equipment
0 a 2018-09-30 00:01:00 she has a shoe
1 a 2018-09-30 00:02:00 this is a hat
2 a 2018-09-30 00:03:00 that is a glove
3 a 2018-09-30 00:04:00 we have a baseball
4 a 2018-09-30 00:05:00 they have a ball
6 b 2018-09-30 00:04:00 this is a glove
7 b 2018-09-30 00:09:00 she has ball
Notice that the group c
has no rows being returned since it has no rows with the word "ball". Group c
has the word "baseball" but that is not the match that we are looking for. Similarly, notice that group a
doesn't stop at the "baseball" row since we are stopping at the row with "ball". What is the most efficient way to accomplish this both from a speed perspective as well as a memory perspective?
Continuing with what you have done:
new_df = df.groupby('id').apply(lambda x: x.sort_values(['time'], ascending=True)).reset_index(drop=True)
new_df["mask"] = new_df.groupby("id").apply(lambda x: x["equipment"].str.contains(r"\bball\b",regex=True)).reset_index(drop=True)
result = (new_df.groupby("id").apply(lambda x : x.iloc[:x.reset_index(drop=True)["mask"].
idxmax()+1 if x["equipment"].str.contains(r"\bball\b",regex=True).any() else 0])
.reset_index(drop=True).drop("mask",axis=1))
print (result)
#
id time equipment
0 a 2018-09-30 00:01:00 she has a shoe
1 a 2018-09-30 00:02:00 this is a hat
2 a 2018-09-30 00:03:00 that is a glove
3 a 2018-09-30 00:04:00 we have a baseball
4 a 2018-09-30 00:05:00 they have a ball
5 b 2018-09-30 00:04:00 this is a glove
6 b 2018-09-30 00:09:00 she has ball
7 d 2018-09-30 00:06:00 I have a ball
Here's my approach:
# as the final expected output is sorted by id and time
# we start by doing so to the whole data
df = df.sort_values(['id','time'])
# mark the rows containing the word `ball`
has_ball = (df.equipment.str.contains(r'\bball\b') )
# cumulative number of rows with `ball` in the group
s = has_ball.groupby(df['id']).cumsum()
# there must be row with `ball`
valid_groups = has_ball.groupby(df['id']).transform('max')
print(df[valid_groups &
(s.eq(0) | # not containing `ball` before the first
(s.eq(1) & has_ball) # first row containing `ball`
)
]
)
Output:
id time equipment
4 a 2018-09-30 00:01:00 she has a shoe
6 a 2018-09-30 00:02:00 this is a hat
0 a 2018-09-30 00:03:00 that is a glove
10 a 2018-09-30 00:04:00 we have a baseball
3 a 2018-09-30 00:05:00 they have a ball
1 b 2018-09-30 00:04:00 this is a glove
2 b 2018-09-30 00:09:00 she has ball
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.