简体   繁体   中英

transfer parameters of two dfs into a new one with pandas

I have two dataframes, which both refer to the same events (marked by id ). One df is discrete and shows the course of the event in a certain resolution over a few months (df1 shows only an excerpt), the other one summarizes the parameters for each event (df_event).

Simplified data: df (the original df has much more rows!)

df = pd.DataFrame({'id':[1,1,1,2,2,2,2],
               'date':['2020-01-01 12:00:00','2020-01-01 12:00:00','2020-01-01 12:00:00','2020-01-05 15:00:00','2020-01-05 15:00:00',
                      '2020-01-05 15:00:00','2020-01-05 15:00:00'],
               'numb':[1,5,8,0,4,11,25]},
             index=pd.date_range(start = "2020-01-01 12:00", periods = 7, freq = '1H'))

df['date'] = pd.to_datetime(df['date'])

Output:

                    id                 date numb
2020-01-01 12:00:00 1   2020-01-01 12:00:00 1
2020-01-01 13:00:00 1   2020-01-01 12:00:00 5
2020-01-01 14:00:00 1   2020-01-01 12:00:00 8
2020-01-01 15:00:00 2   2020-01-05 15:00:00 0
2020-01-01 16:00:00 2   2020-01-05 15:00:00 4
2020-01-01 17:00:00 2   2020-01-05 15:00:00 11
2020-01-01 18:00:00 2   2020-01-05 15:00:00 25

df_event:

df_event = pd.DataFrame({'id':[1,2,3,4,5],
                         'date':['2020-01-01 12:00:00','2020-01-01 15:00:00','2020-01-08 07:00:00','2020-01-15 13:00:00','2020-01-22 12:00:00'],
                         'numb_total':[8,25,11,14,8],
                         'timedelta': [55,60,45,15,30]})

df_event = df_event.set_index('id')
df_event['date'] = pd.to_datetime(df_event['date'])
df_event['timedelta'] = pd.to_timedelta(df_event['timedelta'], unit='T')

Output:

                   date numb_total  timedelta
id          
1   2020-01-01 12:00:00          8   00:55:00
2   2020-01-01 15:00:00         25   01:00:00
3   2020-01-08 07:00:00         11   00:45:00
4   2020-01-15 13:00:00         14   00:15:00
5   2020-01-22 12:00:00          8   00:30:00

now I want to link the two dfs together so that I get a day/week profile. The df should be sorted by hours/days. The average values for numb and timedelta for the time period should then appear here.

The week profile should show which numb and timedelta (from df_event) is the average for the respective moment = day + time (interesting would also be the minimum and maximum value at any moment).

For example df_week create a new df2 like:

df['day'] = df['date'].dt.day_name()
df['time'] = df['date'].dt.time   
df_event = df.groupby(['day', 'time'])...

and than add the data of `df_event, to get sometihing like this:

                       timedelta  numb_total
day             time    
Monday      00:00:00    00:00:00          0
Monday      01:00:00    00:00:00          0 
...
Wednesday   11:00:00    00:00:00          0
Wednesday   12:00:00    00:55:00          8
...
Sunday      14:00:00    00:00:00          0
Sunday      15:00:00    01:00:00         25
Sunday      16:00:00    00:00:00          0
...
Sunday      23:00:00    00:00:00          0

#What is the relationship between the index and date in df? All of them are dates. Which has a relationship with df_event date?

Happy to review after you clarify.

#Generate column key in each datframe extracting hour. Merge the two dataframes on key. Drop columns not required

df2=pd.merge(df.assign(key=df.index.hour),df_event.assign(key=df_event.set_index('date')\
.index.hour),on=['key','date'],how='right').dropna().drop_duplicates(keep='last')[['date','numb_total','timedelta']]


#Extract time and  day_name 


df2['time']=df2.date.dt.strftime('%H:%M:%S')
df2['day']=df2.date.dt.day_name()



    date  n             umb_total    timedelta      time        day
0 2020-01-01 12:00:00           8      00:55:00     12:00:00  Wednesday

IIUC first aggregate both DataFrame s and then merge together:

df_event = df_event.set_index('id')
df_event['date'] = pd.to_datetime(df_event['date'])

df_event['day'] = df_event['date'].dt.day_name()
df_event['time'] = df_event['date'].dt.time   
df_event1 = df_event.groupby(['day', 'time'])[['timedelta', 'numb_total']].mean()
print (df_event1)
                    timedelta  numb_total
day       time                           
Wednesday 07:00:00       45.0        11.0
          12:00:00       42.5         8.0
          13:00:00       15.0        14.0
          15:00:00       60.0        25.0
          
df['day'] = df['date'].dt.day_name()
df['time'] = df['date'].dt.time   
df_event2 = df.groupby(['day', 'time'])['numb'].mean()
print (df_event2)
day        time    
Sunday     15:00:00    10.000000
Wednesday  12:00:00     4.666667
Name: numb, dtype: float64

df = df_event1.join(df_event2, how='inner' )
df['timedelta'] = pd.to_timedelta(df['timedelta'], unit='T')
print (df)
                         timedelta  numb_total      numb
day       time                                          
Wednesday 12:00:00 0 days 00:42:30         8.0  4.666667

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM