繁体   English   中英

Python脚本运行时间太长?

[英]Python script taking too long to run?

我正在编写一个基本上执行以下操作的python脚本

  1. 读取CSV文件作为数据框对象。
  2. 根据名称选择一些列,并将其存储在新的DF对象中。
  3. 对单元格中的值进行一些数学和字符串操作。 我在这里使用for循环和iterrows()方法。
  4. 将修改后的DF写入CSV
  5. 使用for循环将CSV写入json。

此代码永远需要运行。 我试图了解为什么要花这么长时间,以及是否应该以不同的方式执行任务以加快执行速度。

import pandas
import json
import pendulum
import csv
import os
import time

start_time = time.time()
print("--- %s seconds ---" % (time.time() - start_time))

os.chdir('/home/csv_files_from_REC')
df11 = pandas.read_csv('RTP_Gap_2018-01-21.csv') ### Reads the CSV FILE

print df11.shape ### Prints the shape of the DF

### Filter the initial DF by selecting some columns based on NAME
df1 = df11[['ENODEB','DAY','HR','SITE','RTP_Gap_Length_Total_sec','RTP_Session_Duration_Total_sec','RTP_Gap_Duration_Ratio_Avg%']]

print df1.shape ## Prints Shape

#### Math and String manupulation stuff ###
for index, row in df1.iterrows():
    if row['DAY'] == 'Total':
        df1.drop(index, inplace=True)
    else:
        stamp = row['DAY'] + ' ' + str(row['HR']) + ':00:00'
        sitename = str(row['ENODEB'])+'_'+row['SITE']
        if row['RTP_Session_Duration_Total_sec'] == 0:
            rtp_gap = 0
        else:
            rtp_gap = row['RTP_Gap_Length_Total_sec']/row['RTP_Session_Duration_Total_sec']
        time1 = pendulum.parse(stamp,tz='America/Chicago').isoformat()
        df1.loc[index,'DAY'] = time1
        df1.loc[index,'SITE'] = sitename
        df1.loc[index,'HR'] = rtp_gap

### Write DF to CSV ###
df1.to_csv('RTP_json.csv',index=None)
json_file_ind = 'RTP_json.json'
file = open(json_file_ind, 'w')
file.write("")
file.close()

#### Write CSV to JSON ###
with open('RTP_json.csv', 'r') as csvfile:
    reader_ind = csv.DictReader(csvfile)
    row=[]
    for row in reader_ind:         
        row["RTP_Gap_Length_Total_sec"] = float(row["RTP_Gap_Length_Total_sec"])
        row["RTP_Session_Duration_Total_sec"] = float(row["RTP_Session_Duration_Total_sec"])
                row["RTP_Gap_Duration_Ratio_Avg%"]=float(row["RTP_Gap_Duration_Ratio_Avg%"])
        row["HR"] = float(row["HR"])
        with open('RTP_json.json', 'a') as json_file_ind:
            json.dump(row, json_file_ind)
            json_file_ind.write('\n')

 end_time = time.time()
 print("--- %s seconds ---" % (time.time() - end_time))

输出量

    --- 2018-01-23T12:25:07.411691-06:00 seconds ---### START TIME
    (2055, 36) ### SIZE of initial DF
    (2055, 7) ### Size of Filtered DF
    --- 2018-01-23T12:31:54.480568-06:00 seconds --- --- ### END TIME

这部分应该大大加快您的数据框计算

import numpy as np

df1 = df11[['ENODEB','DAY','HR','SITE','RTP_Gap_Length_Total_sec','RTP_Session_Duration_Total_sec','RTP_Gap_Duration_Ratio_Avg%']]

print df1.shape ## Prints Shape

df1 = df1[df1.DAY != 'Total'].reset_index()
df1['DAY'] = pendulum.parse(df1['DAY'] + ' ' + str(df1['HR']) + ':00:00',tz='America/Chicago').isoformat()
df1['SITE'] = str(df1['ENODEB'])+'_'+df1['SITE']
df1['HR'] = np.where(df1['RTP_Session_Duration_Total_sec']==0,0,df1['RTP_Gap_Length_Total_sec']/df1['RTP_Session_Duration_Total_sec'])

另外,为什么还要麻烦写一个csv并再次读取它。

将df转换为json格式

format_json =  df1.to_json(orient='records') # converts df to json list
json_file_ind = 'RTP_json.json'
file = open(json_file_ind, 'w')
for i in format_json:
    file.write(i)
    file.write('\n')

这应该可以大大加快代码的速度

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM