[英]Pandas DataFrame: Groupby Column, Sort By DateTime, and Truncate Group by Condition
I have a Pandas DataFrame that looks similar to:我有一个 Pandas DataFrame 看起来类似于:
import pandas as pd
df = pd.DataFrame([['a', '2018-09-30 00:03:00', 'that is a glove'],
['b', '2018-09-30 00:04:00', 'this is a glove'],
['b', '2018-09-30 00:09:00', 'she has ball'],
['a', '2018-09-30 00:05:00', 'they have a ball'],
['a', '2018-09-30 00:01:00', 'she has a shoe'],
['c', '2018-09-30 00:04:00', 'I have a baseball'],
['a', '2018-09-30 00:02:00', 'this is a hat'],
['a', '2018-09-30 00:06:00', 'he has no helmet'],
['b', '2018-09-30 00:11:00', 'he has no shoe'],
['c', '2018-09-30 00:02:00', 'we have a hat'],
['a', '2018-09-30 00:04:00', 'we have a baseball'],
['c', '2018-09-30 00:06:00', 'they have no glove'],
],
columns=['id', 'time', 'equipment'])
id time equipment
0 a 2018-09-30 00:03:00 that is a glove
1 b 2018-09-30 00:04:00 this is a glove
2 b 2018-09-30 00:09:00 she has ball
3 a 2018-09-30 00:05:00 they have a ball
4 a 2018-09-30 00:01:00 she has a shoe
5 c 2018-09-30 00:04:00 I have a baseball
6 a 2018-09-30 00:02:00 this is a hat
7 a 2018-09-30 00:06:00 he has no helmet
8 b 2018-09-30 00:11:00 he has no shoe
9 c 2018-09-30 00:02:00 we have a hat
10 a 2018-09-30 00:04:00 we have a baseball
11 c 2018-09-30 00:06:00 they have no glove
What I'd like to do is groupby
the id
and, within each group, sort by the time
and then return every row up to and including the row that has the word "ball".我想做的是
groupby
id
分组,并在每个组中按time
排序,然后将每一行返回并包括包含单词“ball”的行。 So far, I can group and sort:到目前为止,我可以分组和排序:
df.groupby('id').apply(lambda x: x.sort_values(['time'], ascending=True)).reset_index(drop=True)
id time equipment
0 a 2018-09-30 00:01:00 she has a shoe
1 a 2018-09-30 00:02:00 this is a hat
2 a 2018-09-30 00:03:00 that is a glove
3 a 2018-09-30 00:04:00 we have a baseball
4 a 2018-09-30 00:05:00 they have a ball
5 a 2018-09-30 00:06:00 he has no helmet
6 b 2018-09-30 00:04:00 this is a glove
7 b 2018-09-30 00:09:00 she has ball
8 b 2018-09-30 00:11:00 he has no shoe
9 c 2018-09-30 00:02:00 we have a hat
10 c 2018-09-30 00:04:00 I have a baseball
11 c 2018-09-30 00:06:00 they have no glove
However, I want the output to look like:但是,我希望 output 看起来像:
id time equipment
0 a 2018-09-30 00:01:00 she has a shoe
1 a 2018-09-30 00:02:00 this is a hat
2 a 2018-09-30 00:03:00 that is a glove
3 a 2018-09-30 00:04:00 we have a baseball
4 a 2018-09-30 00:05:00 they have a ball
6 b 2018-09-30 00:04:00 this is a glove
7 b 2018-09-30 00:09:00 she has ball
Notice that the group c
has no rows being returned since it has no rows with the word "ball".请注意,组
c
没有返回行,因为它没有包含单词“ball”的行。 Group c
has the word "baseball" but that is not the match that we are looking for. c
组有“棒球”一词,但这不是我们正在寻找的比赛。 Similarly, notice that group a
doesn't stop at the "baseball" row since we are stopping at the row with "ball".同样,请注意
a
组不会停在“棒球”行,因为我们停在“球”行。 What is the most efficient way to accomplish this both from a speed perspective as well as a memory perspective?从速度角度和 memory 角度来看,最有效的方法是什么?
Continuing with what you have done:继续你所做的:
new_df = df.groupby('id').apply(lambda x: x.sort_values(['time'], ascending=True)).reset_index(drop=True)
new_df["mask"] = new_df.groupby("id").apply(lambda x: x["equipment"].str.contains(r"\bball\b",regex=True)).reset_index(drop=True)
result = (new_df.groupby("id").apply(lambda x : x.iloc[:x.reset_index(drop=True)["mask"].
idxmax()+1 if x["equipment"].str.contains(r"\bball\b",regex=True).any() else 0])
.reset_index(drop=True).drop("mask",axis=1))
print (result)
#
id time equipment
0 a 2018-09-30 00:01:00 she has a shoe
1 a 2018-09-30 00:02:00 this is a hat
2 a 2018-09-30 00:03:00 that is a glove
3 a 2018-09-30 00:04:00 we have a baseball
4 a 2018-09-30 00:05:00 they have a ball
5 b 2018-09-30 00:04:00 this is a glove
6 b 2018-09-30 00:09:00 she has ball
7 d 2018-09-30 00:06:00 I have a ball
Here's my approach:这是我的方法:
# as the final expected output is sorted by id and time
# we start by doing so to the whole data
df = df.sort_values(['id','time'])
# mark the rows containing the word `ball`
has_ball = (df.equipment.str.contains(r'\bball\b') )
# cumulative number of rows with `ball` in the group
s = has_ball.groupby(df['id']).cumsum()
# there must be row with `ball`
valid_groups = has_ball.groupby(df['id']).transform('max')
print(df[valid_groups &
(s.eq(0) | # not containing `ball` before the first
(s.eq(1) & has_ball) # first row containing `ball`
)
]
)
Output: Output:
id time equipment
4 a 2018-09-30 00:01:00 she has a shoe
6 a 2018-09-30 00:02:00 this is a hat
0 a 2018-09-30 00:03:00 that is a glove
10 a 2018-09-30 00:04:00 we have a baseball
3 a 2018-09-30 00:05:00 they have a ball
1 b 2018-09-30 00:04:00 this is a glove
2 b 2018-09-30 00:09:00 she has ball
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.