I am attempting to extract data from a Google Spreadsheet that is formatted to look like a calendar in order to reformat the data to be batch-uploaded to an information management system we use at work. The final CSV has to have very specific formatting, and I am one step away from a final product.
My current Data Frame looks something like this:
description event_type start_date end_date
Training *Required 6/06/2020
New Staff on duty *Required 6/12/2020
Orientation *Required 6/12/2020
Group 1 Closed Session *Required 6/12/2020
Group 1 Closed Session *Required 6/13/2020
Group 1 Closed Session *Required 6/14/2020
Group 1 Closed Session *Required 6/15/2020
Group 1 Closed Session *Required 6/16/2020
All Staff on duty *Required 6/19/2020
Group 1 Closed Session *Required 6/19/2020
Group 1 Closed Session *Required 6/20/2020
Group 1 Closed Session *Required 6/21/2020
Group 1 Closed Session *Required 6/22/2020
Consumer outreach orientation *Required 6/25/2020
Some event on just another day *Required 6/25/2020
All Staff Meeting *Required 6/28/2020
(The above is only the important slice of the full dataset. I've also changed the content of the data as well, so I apologize of the descriptions aren't very realistic)
Rather than have "Group 1 Closed Session" listed multiple times on several consecutive days, I need to span those dates with a single row--with the first day in the "start_date" column and last date in the "end_date" column. I also need to do that for each group of "Group 1 Closed Sessions", as they span two different date sets.
This example is what I am trying to achieve:
description event_type start_date end_date
Training *Required 6/06/2020
New Staff on duty *Required 6/12/2020
Orientation *Required 6/12/2020
Group 1 Closed Session *Required 6/12/2020 6/16/2020
All Staff on duty *Required 6/19/2020
Group 1 Closed Session *Required 6/19/2020 6/22/2020
Consumer outreach orientation *Required 6/25/2020
Some event on just another day *Required 6/25/2020
All Staff Meeting *Required 6/28/2020
Also, not all of the consecutively-listed events will have the same description, so I was hoping to find a solution where that does not matter.
Any thoughts or leads? I appreciate any help on this.
Try:
df.groupby((df['description'] != df['description'].shift()).cumsum()).first()
Output:
description event_type start_date end_date
description
1 Training *Required 6/06/2020
2 New Staff on duty *Required 6/12/2020
3 Orientation *Required 6/12/2020
4 Group 1 Closed Session *Required 6/12/2020
5 All Staff on duty *Required 6/19/2020
6 Group 1 Closed Session *Required 6/19/2020
7 Consumer outreach orientation *Required 6/25/2020
8 Some event on just another day *Required 6/25/2020
9 All Staff Meeting *Required 6/28/2020
You can use the same groupby
by Scott Boston to get the last row then join it back to get the start and end date?
g = df.groupby((df['description'] != df['description'].shift()).cumsum())
first_df = g.first()
first_df.index = first_df.index.set_names(['id'])
last_df = g['startdate'].agg({'end date' : 'last'})
last_df.index = last_df.index.set_names(['id'])
first_df.merge(last_df, left_index=True, right_index=True)
description event_type startdate end date
id
1 Training *Required 2020-06-06 2020-06-06
2 New Staff on duty *Required 2020-06-12 2020-06-12
3 Orientation *Required 2020-06-12 2020-06-12
4 Group 1 Closed Session *Required 2020-06-12 2020-06-16
5 All Staff on duty *Required 2020-06-19 2020-06-19
6 Group 1 Closed Session *Required 2020-06-19 2020-06-22
7 Consumer outreach orientation *Required 2020-06-25 2020-06-25
8 Some event on just another day *Required 2020-06-25 2020-06-25
9 All Staff Meeting *Required 2020-06-28 2020-06-28
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.