简体   繁体   中英

Pandas - make new column with mean values of the part of another column

I have a big Data Frame with full datetime as index and 2 columns with temperature in every minute (I don't know how to write code with dataframe with time index, sorry):

df = pd.DataFrame(np.array([[210, 211], [212, 215], [212, 215], [214, 214]]),
                columns=['t1', 't2'])
                        t1   t2   
2015-01-01 00:00:00     210  211       
2015-01-01 00:01:00     212  215       
2015-01-01 00:02:00     212  215
... 
2015-01-01 01:05:00     240  232
2015-01-01 01:06:00     206  209

I have to make two new columns t1_mean and t2_mean which contains

  1. t1_mean - mean from first 30 minutes from hour wit beginning from 6 minute (from 2015-01-01 00:06:00 to 2015-01-01 00:35:00, for example)
  2. t2_mean - mean from last 30 minutes from hour wit beginning from 6 minute (from 2015-01-01 00:36:00 to 2015-01-01 01:05:00, for example) and this values have to be in last row of an hour with beginning from 6 minute (2015-01-01 01:05:00, for example)

it should like look like this:

                         t1   t2  t1_mean t2_mean
2015-01-01 00:00:00     210  211   NaN      NaN
2015-01-01 00:01:00     212  215   NaN      NaN
2015-01-01 00:02:00     212  215   NaN      NaN
... 
2015-01-01 01:05:00      240  232   220      228
2015-01-01 01:06:00      206  209   Nan      NaN
... 
2015-01-01 02:05:00      245  234   221      235
...

How to solve this task?

Thanks in advance for replies

Well, this code assume that you have a dataframe df with datetime index datatime_col and two columns t1 and t2 :

mean_1 = {}
mean_2 = {}

for i in range(0,24):
    # If you have performance issues, you can enhance this conditions with numpy arrays
    j = i+1
    if (i < 10):
        i = '0'+str(i)
    if (j < 10):
        j = '0'+str(j)
    if (j == 24):
        j = '00'
    
    row_first = df.between_time(f'{i}:06:00',f'{i}:35:00').reset_index().resample('D', on='datetime_col').mean().reset_index()
    row_last = df.between_time(f'{i}:36:00',f'{j}:05:00').reset_index().resample('D', on='datetime_col').mean().reset_index()
    
    #This just confirm that you have rows in those times
    if len(row_first) != 0 and len(row_last) != 0:
        # By default, pandas mean return a float with lot of decimal values, 
        # Then, you can apply round() or int
        if j == '00':
            mean_1[str((row_first.datetime_col[0].date() + pd.DateOffset(1)).date()) +  f' {j}:05:00'] = [row_first.t1[0]] # [round(row_first.t1[0],1)]
            mean_2[str((row_last.datetime_col[0].date() + pd.DateOffset(1)).date()) +  f' {j}:05:00'] = [row_last.t2[0]] # [round(row_first.t2[0],1)]
        else:
            mean_1[str(row_first.datetime_col[0].date()) +  f' {j}:05:00'] = [row_first.t1[0]]  # [round(row_first.t1[0],1)]
            mean_2[str(row_last.datetime_col[0].date()) +  f' {j}:05:00'] = [row_last.t2[0]]   # [round(row_first.t2[0],1)]
            

df_mean1 = pd.DataFrame.from_dict(mean_1, orient='index', columns=['mean_1']).reset_index().rename(columns={'index':'datetime_col'})
df_mean2 = pd.DataFrame.from_dict(mean_2, orient='index', columns=['mean_2']).reset_index().rename(columns={'index':'datetime_col'})

df_mean1['datetime_col'] = pd.to_datetime(df_mean1['datetime_col'])
df_mean2['datetime_col'] = pd.to_datetime(df_mean2['datetime_col'])

df = df.merge(df_mean1, on = 'datetime_col', how='left')
df = df.merge(df_mean2, on = 'datetime_col', how='left')

Processing flow:.

  1. Add minutes and hours data from the date.
  2. Shift the time column by 6 rows
  3. Add an aggregate flag.
  4. Calculate the average.
  5. Merge with the original DF. ps The average can be four, so there will be four columns.
df1 = df.copy()
df1['minute'] = df.index.minute
df1['hour'] = df.index.strftime('%Y-%m-%d %H:05:00')
df1['hour'] = df1['hour'].shift(6)
df1['flg'] = df1['minute'].apply(lambda x: 0 if 6 <= x <= 35 else 1 )
df1 = df1.groupby(['hour','flg'])[['t1','t2']].mean()
df1 = df1.unstack(level=1)
df1.columns = [f'{a}_{b}' for a,b in df1.columns]
df1.reset_index(col_level=1,inplace=True)
df1['hour'] = pd.to_datetime(df1['hour'])
df.reset_index(inplace=True)
new_df = df.merge(df1, left_on=df['index'], right_on=df1['hour'], how='outer')
new_df.drop(['key_0','hour'], inplace=True ,axis=1)
new_df.head(10)
    index   t1  t2  t1_0    t1_1    t2_0    t2_1
0   2015-01-01 00:00:00 220 212 NaN NaN NaN NaN
1   2015-01-01 00:01:00 244 223 NaN NaN NaN NaN
2   2015-01-01 00:02:00 246 241 NaN NaN NaN NaN
3   2015-01-01 00:03:00 242 241 NaN NaN NaN NaN
4   2015-01-01 00:04:00 233 247 NaN NaN NaN NaN
5   2015-01-01 00:05:00 239 208 222.9   224.4   227.733333  223.266667
6   2015-01-01 00:06:00 212 249 NaN NaN NaN NaN
7   2015-01-01 00:07:00 201 237 NaN NaN NaN NaN
8   2015-01-01 00:08:00 238 217 NaN NaN NaN NaN
9   2015-01-01 00:09:00 218 244 NaN NaN NaN NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM