简体   繁体   中英

Combining Pandas Data Frame Rows and Preserving Data in Separate Columns

I am attempting to extract data from a Google Spreadsheet that is formatted to look like a calendar in order to reformat the data to be batch-uploaded to an information management system we use at work. The final CSV has to have very specific formatting, and I am one step away from a final product.

My current Data Frame looks something like this:

description                     event_type start_date end_date
Training                        *Required  6/06/2020         
New Staff on duty               *Required  6/12/2020         
Orientation                     *Required  6/12/2020         
Group 1 Closed Session          *Required  6/12/2020         
Group 1 Closed Session          *Required  6/13/2020         
Group 1 Closed Session          *Required  6/14/2020         
Group 1 Closed Session          *Required  6/15/2020         
Group 1 Closed Session          *Required  6/16/2020         
All Staff on duty               *Required  6/19/2020         
Group 1 Closed Session          *Required  6/19/2020         
Group 1 Closed Session          *Required  6/20/2020         
Group 1 Closed Session          *Required  6/21/2020         
Group 1 Closed Session          *Required  6/22/2020         
Consumer outreach orientation   *Required  6/25/2020         
Some event on just another day  *Required  6/25/2020         
All Staff Meeting               *Required  6/28/2020    

(The above is only the important slice of the full dataset. I've also changed the content of the data as well, so I apologize of the descriptions aren't very realistic)

Rather than have "Group 1 Closed Session" listed multiple times on several consecutive days, I need to span those dates with a single row--with the first day in the "start_date" column and last date in the "end_date" column. I also need to do that for each group of "Group 1 Closed Sessions", as they span two different date sets.

This example is what I am trying to achieve:

description                     event_type start_date end_date
Training                        *Required  6/06/2020         
New Staff on duty               *Required  6/12/2020         
Orientation                     *Required  6/12/2020         
Group 1 Closed Session          *Required  6/12/2020  6/16/2020        
All Staff on duty               *Required  6/19/2020         
Group 1 Closed Session          *Required  6/19/2020  6/22/2020               
Consumer outreach orientation   *Required  6/25/2020         
Some event on just another day  *Required  6/25/2020         
All Staff Meeting               *Required  6/28/2020

Also, not all of the consecutively-listed events will have the same description, so I was hoping to find a solution where that does not matter.

Any thoughts or leads? I appreciate any help on this.

Try:

df.groupby((df['description'] != df['description'].shift()).cumsum()).first()

Output:

                               description event_type start_date end_date
description                                                               
1                                  Training  *Required           6/06/2020
2                         New Staff on duty  *Required           6/12/2020
3                               Orientation  *Required           6/12/2020
4                    Group 1 Closed Session  *Required           6/12/2020
5                         All Staff on duty  *Required           6/19/2020
6                    Group 1 Closed Session  *Required           6/19/2020
7             Consumer outreach orientation  *Required           6/25/2020
8            Some event on just another day  *Required           6/25/2020
9                         All Staff Meeting  *Required           6/28/2020

You can use the same groupby by Scott Boston to get the last row then join it back to get the start and end date?

g = df.groupby((df['description'] != df['description'].shift()).cumsum())
first_df = g.first()
first_df.index = first_df.index.set_names(['id'])
last_df = g['startdate'].agg({'end date' : 'last'})
last_df.index = last_df.index.set_names(['id'])
first_df.merge(last_df, left_index=True, right_index=True)


description event_type  startdate   end date
id              
1   Training    *Required   2020-06-06  2020-06-06
2   New Staff on duty   *Required   2020-06-12  2020-06-12
3   Orientation *Required   2020-06-12  2020-06-12
4   Group 1 Closed Session  *Required   2020-06-12  2020-06-16
5   All Staff on duty   *Required   2020-06-19  2020-06-19
6   Group 1 Closed Session  *Required   2020-06-19  2020-06-22
7   Consumer outreach orientation   *Required   2020-06-25  2020-06-25
8   Some event on just another day  *Required   2020-06-25  2020-06-25
9   All Staff Meeting   *Required   2020-06-28  2020-06-28

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM