简体   繁体   中英

Pandas DataFrame: Groupby Column, Sort By DateTime, and Truncate Group by Condition

I have a Pandas DataFrame that looks similar to:

import pandas as pd

df = pd.DataFrame([['a', '2018-09-30 00:03:00', 'that is a glove'],
                   ['b', '2018-09-30 00:04:00', 'this is a glove'],
                   ['b', '2018-09-30 00:09:00', 'she has ball'],
                   ['a', '2018-09-30 00:05:00', 'they have a ball'],
                   ['a', '2018-09-30 00:01:00', 'she has a shoe'],
                   ['c', '2018-09-30 00:04:00', 'I have a baseball'],
                   ['a', '2018-09-30 00:02:00', 'this is a hat'],
                   ['a', '2018-09-30 00:06:00', 'he has no helmet'],
                   ['b', '2018-09-30 00:11:00', 'he has no shoe'],
                   ['c', '2018-09-30 00:02:00', 'we have a hat'],
                   ['a', '2018-09-30 00:04:00', 'we have a baseball'],
                   ['c', '2018-09-30 00:06:00', 'they have no glove'],
                   ], 
                  columns=['id', 'time', 'equipment'])


   id                 time           equipment
0   a  2018-09-30 00:03:00     that is a glove
1   b  2018-09-30 00:04:00     this is a glove
2   b  2018-09-30 00:09:00        she has ball
3   a  2018-09-30 00:05:00    they have a ball
4   a  2018-09-30 00:01:00      she has a shoe
5   c  2018-09-30 00:04:00   I have a baseball
6   a  2018-09-30 00:02:00       this is a hat
7   a  2018-09-30 00:06:00    he has no helmet
8   b  2018-09-30 00:11:00      he has no shoe
9   c  2018-09-30 00:02:00       we have a hat
10  a  2018-09-30 00:04:00  we have a baseball
11  c  2018-09-30 00:06:00  they have no glove

What I'd like to do is groupby the id and, within each group, sort by the time and then return every row up to and including the row that has the word "ball". So far, I can group and sort:

df.groupby('id').apply(lambda x: x.sort_values(['time'], ascending=True)).reset_index(drop=True)


   id                 time           equipment
0   a  2018-09-30 00:01:00      she has a shoe
1   a  2018-09-30 00:02:00       this is a hat
2   a  2018-09-30 00:03:00     that is a glove
3   a  2018-09-30 00:04:00  we have a baseball
4   a  2018-09-30 00:05:00    they have a ball
5   a  2018-09-30 00:06:00    he has no helmet
6   b  2018-09-30 00:04:00     this is a glove
7   b  2018-09-30 00:09:00        she has ball
8   b  2018-09-30 00:11:00      he has no shoe
9   c  2018-09-30 00:02:00       we have a hat
10  c  2018-09-30 00:04:00   I have a baseball
11  c  2018-09-30 00:06:00  they have no glove

However, I want the output to look like:

   id                 time           equipment
0   a  2018-09-30 00:01:00      she has a shoe
1   a  2018-09-30 00:02:00       this is a hat
2   a  2018-09-30 00:03:00     that is a glove
3   a  2018-09-30 00:04:00  we have a baseball
4   a  2018-09-30 00:05:00    they have a ball
6   b  2018-09-30 00:04:00     this is a glove
7   b  2018-09-30 00:09:00        she has ball

Notice that the group c has no rows being returned since it has no rows with the word "ball". Group c has the word "baseball" but that is not the match that we are looking for. Similarly, notice that group a doesn't stop at the "baseball" row since we are stopping at the row with "ball". What is the most efficient way to accomplish this both from a speed perspective as well as a memory perspective?

Continuing with what you have done:

new_df = df.groupby('id').apply(lambda x: x.sort_values(['time'], ascending=True)).reset_index(drop=True)

new_df["mask"] = new_df.groupby("id").apply(lambda x: x["equipment"].str.contains(r"\bball\b",regex=True)).reset_index(drop=True)

result = (new_df.groupby("id").apply(lambda x : x.iloc[:x.reset_index(drop=True)["mask"].
                                     idxmax()+1 if x["equipment"].str.contains(r"\bball\b",regex=True).any() else 0])
          .reset_index(drop=True).drop("mask",axis=1))

print (result)

#
  id                 time           equipment
0  a  2018-09-30 00:01:00      she has a shoe
1  a  2018-09-30 00:02:00       this is a hat
2  a  2018-09-30 00:03:00     that is a glove
3  a  2018-09-30 00:04:00  we have a baseball
4  a  2018-09-30 00:05:00    they have a ball
5  b  2018-09-30 00:04:00     this is a glove
6  b  2018-09-30 00:09:00        she has ball
7  d  2018-09-30 00:06:00       I have a ball

Here's my approach:

# as the final expected output is sorted by id and time
# we start by doing so to the whole data
df = df.sort_values(['id','time'])

# mark the rows containing the word `ball`
has_ball = (df.equipment.str.contains(r'\bball\b') )

# cumulative number of rows with `ball` in the group
s = has_ball.groupby(df['id']).cumsum()

# there must be row with `ball`
valid_groups = has_ball.groupby(df['id']).transform('max')

print(df[valid_groups &
         (s.eq(0) |              # not containing `ball` before the first
         (s.eq(1) & has_ball)    # first row containing `ball`
         )
        ]  
     )

Output:

   id                time           equipment
4   a 2018-09-30 00:01:00      she has a shoe
6   a 2018-09-30 00:02:00       this is a hat
0   a 2018-09-30 00:03:00     that is a glove
10  a 2018-09-30 00:04:00  we have a baseball
3   a 2018-09-30 00:05:00    they have a ball
1   b 2018-09-30 00:04:00     this is a glove
2   b 2018-09-30 00:09:00        she has ball

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM