简体   繁体   English

Pandas - 使用另一列部分的平均值创建新列

[英]Pandas - make new column with mean values of the part of another column

I have a big Data Frame with full datetime as index and 2 columns with temperature in every minute (I don't know how to write code with dataframe with time index, sorry):我有一个大数据框,其中包含完整的日期时间作为索引,每分钟有 2 列带有温度(我不知道如何使用 dataframe 和时间索引编写代码,抱歉):

df = pd.DataFrame(np.array([[210, 211], [212, 215], [212, 215], [214, 214]]),
                columns=['t1', 't2'])
                        t1   t2   
2015-01-01 00:00:00     210  211       
2015-01-01 00:01:00     212  215       
2015-01-01 00:02:00     212  215
... 
2015-01-01 01:05:00     240  232
2015-01-01 01:06:00     206  209

I have to make two new columns t1_mean and t2_mean which contains我必须创建两个新列 t1_mean 和 t2_mean ,其中包含

  1. t1_mean - mean from first 30 minutes from hour wit beginning from 6 minute (from 2015-01-01 00:06:00 to 2015-01-01 00:35:00, for example) t1_mean - 从 6 分钟开始的一小时的前 30 分钟(例如,从 2015-01-01 00:06:00 到 2015-01-01 00:35:00)
  2. t2_mean - mean from last 30 minutes from hour wit beginning from 6 minute (from 2015-01-01 00:36:00 to 2015-01-01 01:05:00, for example) and this values have to be in last row of an hour with beginning from 6 minute (2015-01-01 01:05:00, for example) t2_mean - 从 6 分钟开始(例如从 2015-01-01 00:36:00 到 2015-01-01 01:05:00)的最后 30 分钟的平均值,并且该值必须在最后一行从 6 分钟开始的一小时(例如 2015-01-01 01:05:00)

it should like look like this:它应该看起来像这样:

                         t1   t2  t1_mean t2_mean
2015-01-01 00:00:00     210  211   NaN      NaN
2015-01-01 00:01:00     212  215   NaN      NaN
2015-01-01 00:02:00     212  215   NaN      NaN
... 
2015-01-01 01:05:00      240  232   220      228
2015-01-01 01:06:00      206  209   Nan      NaN
... 
2015-01-01 02:05:00      245  234   221      235
...

How to solve this task?如何解决这个任务?

Thanks in advance for replies提前感谢您的回复

Well, this code assume that you have a dataframe df with datetime index datatime_col and two columns t1 and t2 :好吧,这段代码假设您有一个 dataframe df ,其中包含日期时间索引datatime_col和两列t1t2

mean_1 = {}
mean_2 = {}

for i in range(0,24):
    # If you have performance issues, you can enhance this conditions with numpy arrays
    j = i+1
    if (i < 10):
        i = '0'+str(i)
    if (j < 10):
        j = '0'+str(j)
    if (j == 24):
        j = '00'
    
    row_first = df.between_time(f'{i}:06:00',f'{i}:35:00').reset_index().resample('D', on='datetime_col').mean().reset_index()
    row_last = df.between_time(f'{i}:36:00',f'{j}:05:00').reset_index().resample('D', on='datetime_col').mean().reset_index()
    
    #This just confirm that you have rows in those times
    if len(row_first) != 0 and len(row_last) != 0:
        # By default, pandas mean return a float with lot of decimal values, 
        # Then, you can apply round() or int
        if j == '00':
            mean_1[str((row_first.datetime_col[0].date() + pd.DateOffset(1)).date()) +  f' {j}:05:00'] = [row_first.t1[0]] # [round(row_first.t1[0],1)]
            mean_2[str((row_last.datetime_col[0].date() + pd.DateOffset(1)).date()) +  f' {j}:05:00'] = [row_last.t2[0]] # [round(row_first.t2[0],1)]
        else:
            mean_1[str(row_first.datetime_col[0].date()) +  f' {j}:05:00'] = [row_first.t1[0]]  # [round(row_first.t1[0],1)]
            mean_2[str(row_last.datetime_col[0].date()) +  f' {j}:05:00'] = [row_last.t2[0]]   # [round(row_first.t2[0],1)]
            

df_mean1 = pd.DataFrame.from_dict(mean_1, orient='index', columns=['mean_1']).reset_index().rename(columns={'index':'datetime_col'})
df_mean2 = pd.DataFrame.from_dict(mean_2, orient='index', columns=['mean_2']).reset_index().rename(columns={'index':'datetime_col'})

df_mean1['datetime_col'] = pd.to_datetime(df_mean1['datetime_col'])
df_mean2['datetime_col'] = pd.to_datetime(df_mean2['datetime_col'])

df = df.merge(df_mean1, on = 'datetime_col', how='left')
df = df.merge(df_mean2, on = 'datetime_col', how='left')

Processing flow:.处理流程:。

  1. Add minutes and hours data from the date.从日期添加分钟和小时数据。
  2. Shift the time column by 6 rows将时间列移动 6 行
  3. Add an aggregate flag.添加一个聚合标志。
  4. Calculate the average.计算平均值。
  5. Merge with the original DF.与原始 DF 合并。 ps The average can be four, so there will be four columns. ps 平均可以是四,所以会有四列。
df1 = df.copy()
df1['minute'] = df.index.minute
df1['hour'] = df.index.strftime('%Y-%m-%d %H:05:00')
df1['hour'] = df1['hour'].shift(6)
df1['flg'] = df1['minute'].apply(lambda x: 0 if 6 <= x <= 35 else 1 )
df1 = df1.groupby(['hour','flg'])[['t1','t2']].mean()
df1 = df1.unstack(level=1)
df1.columns = [f'{a}_{b}' for a,b in df1.columns]
df1.reset_index(col_level=1,inplace=True)
df1['hour'] = pd.to_datetime(df1['hour'])
df.reset_index(inplace=True)
new_df = df.merge(df1, left_on=df['index'], right_on=df1['hour'], how='outer')
new_df.drop(['key_0','hour'], inplace=True ,axis=1)
new_df.head(10)
    index   t1  t2  t1_0    t1_1    t2_0    t2_1
0   2015-01-01 00:00:00 220 212 NaN NaN NaN NaN
1   2015-01-01 00:01:00 244 223 NaN NaN NaN NaN
2   2015-01-01 00:02:00 246 241 NaN NaN NaN NaN
3   2015-01-01 00:03:00 242 241 NaN NaN NaN NaN
4   2015-01-01 00:04:00 233 247 NaN NaN NaN NaN
5   2015-01-01 00:05:00 239 208 222.9   224.4   227.733333  223.266667
6   2015-01-01 00:06:00 212 249 NaN NaN NaN NaN
7   2015-01-01 00:07:00 201 237 NaN NaN NaN NaN
8   2015-01-01 00:08:00 238 217 NaN NaN NaN NaN
9   2015-01-01 00:09:00 218 244 NaN NaN NaN NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM