Pandas - 使用另一列部分的平均值创建新列

Question

I have a big Data Frame with full datetime as index and 2 columns with temperature in every minute (I don't know how to write code with dataframe with time index, sorry):我有一个大数据框，其中包含完整的日期时间作为索引，每分钟有 2 列带有温度（我不知道如何使用 dataframe 和时间索引编写代码，抱歉）：

df = pd.DataFrame(np.array([[210, 211], [212, 215], [212, 215], [214, 214]]),
                columns=['t1', 't2'])
                        t1   t2   
2015-01-01 00:00:00     210  211       
2015-01-01 00:01:00     212  215       
2015-01-01 00:02:00     212  215
... 
2015-01-01 01:05:00     240  232
2015-01-01 01:06:00     206  209

I have to make two new columns t1_mean and t2_mean which contains我必须创建两个新列 t1_mean 和 t2_mean ，其中包含

t1_mean - mean from first 30 minutes from hour wit beginning from 6 minute (from 2015-01-01 00:06:00 to 2015-01-01 00:35:00, for example) t1_mean - 从 6 分钟开始的一小时的前 30 分钟（例如，从 2015-01-01 00:06:00 到 2015-01-01 00:35:00）
t2_mean - mean from last 30 minutes from hour wit beginning from 6 minute (from 2015-01-01 00:36:00 to 2015-01-01 01:05:00, for example) and this values have to be in last row of an hour with beginning from 6 minute (2015-01-01 01:05:00, for example) t2_mean - 从 6 分钟开始（例如从 2015-01-01 00:36:00 到 2015-01-01 01:05:00）的最后 30 分钟的平均值，并且该值必须在最后一行从 6 分钟开始的一小时（例如 2015-01-01 01:05:00）

it should like look like this:它应该看起来像这样：

                         t1   t2  t1_mean t2_mean
2015-01-01 00:00:00     210  211   NaN      NaN
2015-01-01 00:01:00     212  215   NaN      NaN
2015-01-01 00:02:00     212  215   NaN      NaN
... 
2015-01-01 01:05:00      240  232   220      228
2015-01-01 01:06:00      206  209   Nan      NaN
... 
2015-01-01 02:05:00      245  234   221      235
...

How to solve this task?如何解决这个任务？

Thanks in advance for replies提前感谢您的回复

Answer 1

Well, this code assume that you have a dataframe df with datetime index datatime_col and two columns t1 and t2 :好吧，这段代码假设您有一个 dataframe df ，其中包含日期时间索引datatime_col和两列t1和t2 ：

mean_1 = {}
mean_2 = {}

for i in range(0,24):
    # If you have performance issues, you can enhance this conditions with numpy arrays
    j = i+1
    if (i < 10):
        i = '0'+str(i)
    if (j < 10):
        j = '0'+str(j)
    if (j == 24):
        j = '00'
    
    row_first = df.between_time(f'{i}:06:00',f'{i}:35:00').reset_index().resample('D', on='datetime_col').mean().reset_index()
    row_last = df.between_time(f'{i}:36:00',f'{j}:05:00').reset_index().resample('D', on='datetime_col').mean().reset_index()
    
    #This just confirm that you have rows in those times
    if len(row_first) != 0 and len(row_last) != 0:
        # By default, pandas mean return a float with lot of decimal values, 
        # Then, you can apply round() or int
        if j == '00':
            mean_1[str((row_first.datetime_col[0].date() + pd.DateOffset(1)).date()) +  f' {j}:05:00'] = [row_first.t1[0]] # [round(row_first.t1[0],1)]
            mean_2[str((row_last.datetime_col[0].date() + pd.DateOffset(1)).date()) +  f' {j}:05:00'] = [row_last.t2[0]] # [round(row_first.t2[0],1)]
        else:
            mean_1[str(row_first.datetime_col[0].date()) +  f' {j}:05:00'] = [row_first.t1[0]]  # [round(row_first.t1[0],1)]
            mean_2[str(row_last.datetime_col[0].date()) +  f' {j}:05:00'] = [row_last.t2[0]]   # [round(row_first.t2[0],1)]
            

df_mean1 = pd.DataFrame.from_dict(mean_1, orient='index', columns=['mean_1']).reset_index().rename(columns={'index':'datetime_col'})
df_mean2 = pd.DataFrame.from_dict(mean_2, orient='index', columns=['mean_2']).reset_index().rename(columns={'index':'datetime_col'})

df_mean1['datetime_col'] = pd.to_datetime(df_mean1['datetime_col'])
df_mean2['datetime_col'] = pd.to_datetime(df_mean2['datetime_col'])

df = df.merge(df_mean1, on = 'datetime_col', how='left')
df = df.merge(df_mean2, on = 'datetime_col', how='left')

Answer 2

Processing flow:.处理流程：。

Add minutes and hours data from the date.从日期添加分钟和小时数据。
Shift the time column by 6 rows将时间列移动 6 行
Add an aggregate flag.添加一个聚合标志。
Calculate the average.计算平均值。
Merge with the original DF.与原始 DF 合并。 ps The average can be four, so there will be four columns. ps 平均可以是四，所以会有四列。

df1 = df.copy()
df1['minute'] = df.index.minute
df1['hour'] = df.index.strftime('%Y-%m-%d %H:05:00')
df1['hour'] = df1['hour'].shift(6)
df1['flg'] = df1['minute'].apply(lambda x: 0 if 6 <= x <= 35 else 1 )
df1 = df1.groupby(['hour','flg'])[['t1','t2']].mean()
df1 = df1.unstack(level=1)
df1.columns = [f'{a}_{b}' for a,b in df1.columns]
df1.reset_index(col_level=1,inplace=True)
df1['hour'] = pd.to_datetime(df1['hour'])
df.reset_index(inplace=True)
new_df = df.merge(df1, left_on=df['index'], right_on=df1['hour'], how='outer')
new_df.drop(['key_0','hour'], inplace=True ,axis=1)
new_df.head(10)
    index   t1  t2  t1_0    t1_1    t2_0    t2_1
0   2015-01-01 00:00:00 220 212 NaN NaN NaN NaN
1   2015-01-01 00:01:00 244 223 NaN NaN NaN NaN
2   2015-01-01 00:02:00 246 241 NaN NaN NaN NaN
3   2015-01-01 00:03:00 242 241 NaN NaN NaN NaN
4   2015-01-01 00:04:00 233 247 NaN NaN NaN NaN
5   2015-01-01 00:05:00 239 208 222.9   224.4   227.733333  223.266667
6   2015-01-01 00:06:00 212 249 NaN NaN NaN NaN
7   2015-01-01 00:07:00 201 237 NaN NaN NaN NaN
8   2015-01-01 00:08:00 238 217 NaN NaN NaN NaN
9   2015-01-01 00:09:00 218 244 NaN NaN NaN NaN

Pandas - 使用另一列部分的平均值创建新列

问题描述

2 个解决方案

解决方案1
1 2020-07-09 05:08:12

解决方案2
1 2020-07-09 05:23:29

Pandas - 使用另一列部分的平均值创建新列

问题描述

2 个解决方案

解决方案1 1 2020-07-09 05:08:12

解决方案2 1 2020-07-09 05:23:29

解决方案1
1 2020-07-09 05:08:12

解决方案2
1 2020-07-09 05:23:29