用于转换时间的代码花费的时间太长

Question

I have a dataframe as follows (reproducible data):我有一个 dataframe 如下（可重现的数据）：

np.random.seed(365)
rows = 17000

data = np.random.uniform(20.25, 23.625, size=(rows, 1))

df = pd.DataFrame(data , columns=['Ta'])

'Set index'
Epoch_Start=1636757999
Epoch_End=1636844395
time = np.arange(Epoch_Start,Epoch_End,5)
df['Epoch']=pd.DataFrame(time)

df.reset_index(drop=True, inplace=True)
df=df.set_index('Epoch')

                   
Epoch          Ta      
1636757999  23.427413
1636758004  22.415409
1636758009  22.560560
1636758014  22.236397
1636758019  22.085619
              ...
1636842974  21.342487
1636842979  20.863043
1636842984  22.582027
1636842989  20.756926
1636842994  21.255536

[17000 rows x 1 columns]

Me expected output is: 1.- Column with the date convert from Epochtime to Datetime (Column 'dates' in the return value of function).我预计 output 是： 1.- 日期从 Epochtime 转换为 Datetime 的列（函数返回值中的列“日期”）。 (example: 2021-11-12 22:59:59) （例：2021-11-12 22:59:59）

Heres the code that im using:这是我使用的代码：

def obt_dat(path):
    df2=df
    df2['date'] = df.index.values
    df2['date'] = pd.to_datetime(df2['date'],unit='s')
    df2['hour']=''
    df2['fecha']=''
    df2['dates']=''
    
    start = time.time()
    for i in range(0,len(df2)):
        df2['hour'].iloc[i]=df2['date'].iloc[i].hour 
        df2['fecha'].iloc[i]=str(df2['date'].iloc[i].year)+str(df2['date'].iloc[i].month)+str(df2['date'].iloc[i].day) 
        df2['dates'] = df2['fecha'].astype(str) + df2['hour'].astype(str)
        
    end = time.time() 
    T=round((end-start)/60,2)
    print('Tiempo de Ejecución Total: ' + str(T) + ' minutos')

     
    return(df2)
        

obt_dat(df)

After that im using .groupby to get the mean values from specific hours.之后我使用.groupby从特定时间获取平均值。 But, the problem is that the code is taking to long to execute.但是，问题是代码需要很长时间才能执行。 Can anyone have an idea to short the elapsed time of the function obt_dat()谁能想到缩短 function obt_dat()的运行时间

Answer 1

Use plain Python - lists or dicts instead of dataframes.使用普通的 Python - 列表或字典而不是数据帧。

If you really need a dataframe, construct it at the end of CPU-intensive operations.如果您确实需要 dataframe，请在 CPU 密集型操作结束时构建它。

But that's just my assumption - you might want to do some benchmarking to see how much time each part of the code really takes.但这只是我的假设——你可能想做一些基准测试来看看代码的每个部分真正需要多少时间。 "Very long" is relative, but I'm pretty sure that your bottleneck are the dataframe operations you do in the for loop. “很长”是相对的，但我很确定您的瓶颈是您在for循环中执行的 dataframe 操作。

Answer 2

You can use the dt (date accessors) to eliminate the loops:您可以使用dt （日期访问器）来消除循环：

df2 = df.copy()
df2['date'] = df.index.values
df2['date'] = pd.to_datetime(df2['date'], unit='s')

df2['hour'] = df2['date'].dt.hour
df2['fecha'] = df2['date'].dt.strftime('%Y%m%d')
df2['dates'] = df2['date'].dt.strftime('%Y%m%d%H')

Timing with your reproducible example gives:使用您的可重现示例的时间给出：

156 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

用于转换时间的代码花费的时间太长

问题描述

2 个解决方案

解决方案1
0 2021-11-29 19:52:15

解决方案2
0 2021-11-29 19:58:06

用于转换时间的代码花费的时间太长

问题描述

2 个解决方案

解决方案1 0 2021-11-29 19:52:15

解决方案2 0 2021-11-29 19:58:06

解决方案1
0 2021-11-29 19:52:15

解决方案2
0 2021-11-29 19:58:06