简体   繁体   English

Pandas DataFrame:按列分组,按日期时间排序,按条件截断分组

[英]Pandas DataFrame: Groupby Column, Sort By DateTime, and Truncate Group by Condition

I have a Pandas DataFrame that looks similar to:我有一个 Pandas DataFrame 看起来类似于:

import pandas as pd

df = pd.DataFrame([['a', '2018-09-30 00:03:00', 'that is a glove'],
                   ['b', '2018-09-30 00:04:00', 'this is a glove'],
                   ['b', '2018-09-30 00:09:00', 'she has ball'],
                   ['a', '2018-09-30 00:05:00', 'they have a ball'],
                   ['a', '2018-09-30 00:01:00', 'she has a shoe'],
                   ['c', '2018-09-30 00:04:00', 'I have a baseball'],
                   ['a', '2018-09-30 00:02:00', 'this is a hat'],
                   ['a', '2018-09-30 00:06:00', 'he has no helmet'],
                   ['b', '2018-09-30 00:11:00', 'he has no shoe'],
                   ['c', '2018-09-30 00:02:00', 'we have a hat'],
                   ['a', '2018-09-30 00:04:00', 'we have a baseball'],
                   ['c', '2018-09-30 00:06:00', 'they have no glove'],
                   ], 
                  columns=['id', 'time', 'equipment'])


   id                 time           equipment
0   a  2018-09-30 00:03:00     that is a glove
1   b  2018-09-30 00:04:00     this is a glove
2   b  2018-09-30 00:09:00        she has ball
3   a  2018-09-30 00:05:00    they have a ball
4   a  2018-09-30 00:01:00      she has a shoe
5   c  2018-09-30 00:04:00   I have a baseball
6   a  2018-09-30 00:02:00       this is a hat
7   a  2018-09-30 00:06:00    he has no helmet
8   b  2018-09-30 00:11:00      he has no shoe
9   c  2018-09-30 00:02:00       we have a hat
10  a  2018-09-30 00:04:00  we have a baseball
11  c  2018-09-30 00:06:00  they have no glove

What I'd like to do is groupby the id and, within each group, sort by the time and then return every row up to and including the row that has the word "ball".我想做的是groupby id分组,并在每个组中按time排序,然后将每一行返回并包括包含单词“ball”的行。 So far, I can group and sort:到目前为止,我可以分组和排序:

df.groupby('id').apply(lambda x: x.sort_values(['time'], ascending=True)).reset_index(drop=True)


   id                 time           equipment
0   a  2018-09-30 00:01:00      she has a shoe
1   a  2018-09-30 00:02:00       this is a hat
2   a  2018-09-30 00:03:00     that is a glove
3   a  2018-09-30 00:04:00  we have a baseball
4   a  2018-09-30 00:05:00    they have a ball
5   a  2018-09-30 00:06:00    he has no helmet
6   b  2018-09-30 00:04:00     this is a glove
7   b  2018-09-30 00:09:00        she has ball
8   b  2018-09-30 00:11:00      he has no shoe
9   c  2018-09-30 00:02:00       we have a hat
10  c  2018-09-30 00:04:00   I have a baseball
11  c  2018-09-30 00:06:00  they have no glove

However, I want the output to look like:但是,我希望 output 看起来像:

   id                 time           equipment
0   a  2018-09-30 00:01:00      she has a shoe
1   a  2018-09-30 00:02:00       this is a hat
2   a  2018-09-30 00:03:00     that is a glove
3   a  2018-09-30 00:04:00  we have a baseball
4   a  2018-09-30 00:05:00    they have a ball
6   b  2018-09-30 00:04:00     this is a glove
7   b  2018-09-30 00:09:00        she has ball

Notice that the group c has no rows being returned since it has no rows with the word "ball".请注意,组c没有返回行,因为它没有包含单词“ball”的行。 Group c has the word "baseball" but that is not the match that we are looking for. c组有“棒球”一词,但这不是我们正在寻找的比赛。 Similarly, notice that group a doesn't stop at the "baseball" row since we are stopping at the row with "ball".同样,请注意a组不会停在“棒球”行,因为我们停在“球”行。 What is the most efficient way to accomplish this both from a speed perspective as well as a memory perspective?从速度角度和 memory 角度来看,最有效的方法是什么?

Continuing with what you have done:继续你所做的:

new_df = df.groupby('id').apply(lambda x: x.sort_values(['time'], ascending=True)).reset_index(drop=True)

new_df["mask"] = new_df.groupby("id").apply(lambda x: x["equipment"].str.contains(r"\bball\b",regex=True)).reset_index(drop=True)

result = (new_df.groupby("id").apply(lambda x : x.iloc[:x.reset_index(drop=True)["mask"].
                                     idxmax()+1 if x["equipment"].str.contains(r"\bball\b",regex=True).any() else 0])
          .reset_index(drop=True).drop("mask",axis=1))

print (result)

#
  id                 time           equipment
0  a  2018-09-30 00:01:00      she has a shoe
1  a  2018-09-30 00:02:00       this is a hat
2  a  2018-09-30 00:03:00     that is a glove
3  a  2018-09-30 00:04:00  we have a baseball
4  a  2018-09-30 00:05:00    they have a ball
5  b  2018-09-30 00:04:00     this is a glove
6  b  2018-09-30 00:09:00        she has ball
7  d  2018-09-30 00:06:00       I have a ball

Here's my approach:这是我的方法:

# as the final expected output is sorted by id and time
# we start by doing so to the whole data
df = df.sort_values(['id','time'])

# mark the rows containing the word `ball`
has_ball = (df.equipment.str.contains(r'\bball\b') )

# cumulative number of rows with `ball` in the group
s = has_ball.groupby(df['id']).cumsum()

# there must be row with `ball`
valid_groups = has_ball.groupby(df['id']).transform('max')

print(df[valid_groups &
         (s.eq(0) |              # not containing `ball` before the first
         (s.eq(1) & has_ball)    # first row containing `ball`
         )
        ]  
     )

Output: Output:

   id                time           equipment
4   a 2018-09-30 00:01:00      she has a shoe
6   a 2018-09-30 00:02:00       this is a hat
0   a 2018-09-30 00:03:00     that is a glove
10  a 2018-09-30 00:04:00  we have a baseball
3   a 2018-09-30 00:05:00    they have a ball
1   b 2018-09-30 00:04:00     this is a glove
2   b 2018-09-30 00:09:00        she has ball

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM