我怎樣才能使我的python代碼運行得更快

Question

我正在研究循環多個netcdf文件（大〜28G）的代碼。 netcdf文件在整個域中具有多個4D變量[時間，東西，南北，高度]。 目標是遍歷這些文件，並遍歷域中所有這些變量的每個位置，並提取某些變量以存儲到大數組中。 當文件丟失或不完整時，我用99.99填充值。 現在，我只是通過循環2個每日的netcdf文件進行測試，但是由於某種原因，它要花很多時間（〜14小時）。 我不確定是否有優化此代碼的方法。 我認為python不需要花這么長時間來完成此任務，但是python或我的代碼可能有問題。 下面是我的代碼，希望它是可讀的，並且對如何使此過程更快的任何建議深表感謝：

#Domain to loop over
k_space = np.arange(0,37)
j_space = np.arange(80,170)
i_space = np.arange(200,307)

predictors_wrf=[]
names_wrf=[]

counter = 0
cdate = start_date
while cdate <= end_date:
    if cdate.month not in month_keep:
        cdate+=inc
        continue
    yy = cdate.strftime('%Y')        
    mm = cdate.strftime('%m')
    dd = cdate.strftime('%d')
    filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00'
    for i in i_space:
        for j in j_space:
            for k in k_space:
                    if os.path.isfile(filename):
                        f = nc.Dataset(filename,'r')
                        times = f.variables['Times'][1:]
                        num_lines = times.shape[0]
                        if num_lines == 144:
                            u = f.variables['U'][1:,k,j,i]
                            v = f.variables['V'][1:,k,j,i]
                            wspd = np.sqrt(u**2.+v**2.)
                            w = f.variables['W'][1:,k,j,i]
                            p = f.variables['P'][1:,k,j,i]
                            t = f.variables['T'][1:,k,j,i]
                        if num_lines < 144:
                            print "partial files for WRF: "+ filename
                            u = np.ones((144,))*99.99
                            v = np.ones((144,))*99.99
                            wspd = np.ones((144,))*99.99
                            w = np.ones((144,))*99.99
                            p = np.ones((144,))*99.99
                            t = np.ones((144,))*99.99
                    else:
                        u = np.ones((144,))*99.99
                        v = np.ones((144,))*99.99
                        wspd = np.ones((144,))*99.99
                        w = np.ones((144,))*99.99
                        p = np.ones((144,))*99.99
                        t = np.ones((144,))*99.99
                        counter=counter+1
                    predictors_wrf.append(u)
                    predictors_wrf.append(v)
                    predictors_wrf.append(wspd)
                    predictors_wrf.append(w)
                    predictors_wrf.append(p)
                    predictors_wrf.append(t)
                    u_names = 'u_'+str(k)+'_'+str(j)+'_'+str(i)
                    v_names = 'v_'+str(k)+'_'+str(j)+'_'+str(i)
                    wspd_names = 'wspd_'+str(k)+'_'+str(j)+'_'+str(i)
                    w_names = 'w_'+str(k)+'_'+str(j)+'_'+str(i)
                    p_names = 'p_'+str(k)+'_'+str(j)+'_'+str(i)
                    t_names = 't_'+str(k)+'_'+str(j)+'_'+str(i)
                    names_wrf.append(u_names)
                    names_wrf.append(v_names)
                    names_wrf.append(wspd_names)
                    names_wrf.append(w_names)
                    names_wrf.append(p_names)
                    names_wrf.append(t_names)
    cdate+=inc

Answer 1

這是forloop第一次通過，以加強您的forloop 。 由於每個文件僅使用一次文件形狀，因此可以將處理移到循環外，這樣可以減少中斷處理時的數據加載量。 我仍然不了解counter和inc功能，因為它們似乎沒有在循環中更新。 您肯定想研究重復的字符串連接性能，或將predictors_wrf names_wrf和names_wrf追加的性能如何作為起點

k_space = np.arange(0,37)
j_space = np.arange(80,170)
i_space = np.arange(200,307)

predictors_wrf=[]
names_wrf=[]

counter = 0
cdate = start_date
while cdate <= end_date:
    if cdate.month not in month_keep:
        cdate+=inc
        continue
    yy = cdate.strftime('%Y')        
    mm = cdate.strftime('%m')
    dd = cdate.strftime('%d')
    filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00'
    file_exists = os.path.isfile(filename)
    if file_exists:
        f = nc.Dataset(filename,'r')
        times = f.variables['Times'][1:]
        num_lines = times.shape[0]
    for i in i_space:
        for j in j_space:
            for k in k_space:
                    if file_exists:    
                        if num_lines == 144:
                            u = f.variables['U'][1:,k,j,i]
                            v = f.variables['V'][1:,k,j,i]
                            wspd = np.sqrt(u**2.+v**2.)
                            w = f.variables['W'][1:,k,j,i]
                            p = f.variables['P'][1:,k,j,i]
                            t = f.variables['T'][1:,k,j,i]
                        if num_lines < 144:
                            print "partial files for WRF: "+ filename
                            u = np.ones((144,))*99.99
                            v = np.ones((144,))*99.99
                            wspd = np.ones((144,))*99.99
                            w = np.ones((144,))*99.99
                            p = np.ones((144,))*99.99
                            t = np.ones((144,))*99.99
                    else:
                        u = np.ones((144,))*99.99
                        v = np.ones((144,))*99.99
                        wspd = np.ones((144,))*99.99
                        w = np.ones((144,))*99.99
                        p = np.ones((144,))*99.99
                        t = np.ones((144,))*99.99
                        counter=counter+1
                    predictors_wrf.append(u)
                    predictors_wrf.append(v)
                    predictors_wrf.append(wspd)
                    predictors_wrf.append(w)
                    predictors_wrf.append(p)
                    predictors_wrf.append(t)
                    u_names = 'u_'+str(k)+'_'+str(j)+'_'+str(i)
                    v_names = 'v_'+str(k)+'_'+str(j)+'_'+str(i)
                    wspd_names = 'wspd_'+str(k)+'_'+str(j)+'_'+str(i)
                    w_names = 'w_'+str(k)+'_'+str(j)+'_'+str(i)
                    p_names = 'p_'+str(k)+'_'+str(j)+'_'+str(i)
                    t_names = 't_'+str(k)+'_'+str(j)+'_'+str(i)
                    names_wrf.append(u_names)
                    names_wrf.append(v_names)
                    names_wrf.append(wspd_names)
                    names_wrf.append(w_names)
                    names_wrf.append(p_names)
                    names_wrf.append(t_names)
    cdate+=inc

Answer 2

對於您的問題，我認為多處理會很有幫助。 我仔細閱讀了您的代碼，並在此處提出了一些建議。

不使用開始時間，而是使用文件名作為代碼中的迭代器。

包裝一個函數，根據時間找出所有文件名，並返回所有文件名的列表。

 def fileNames(start_date, end_date): # Find all filenames. cdate = start_date fileNameList = [] while cdate <= end_date: if cdate.month not in month_keep: cdate+=inc continue yy = cdate.strftime('%Y') mm = cdate.strftime('%m') dd = cdate.strftime('%d') filename = wrf_path+'\\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00' fileNameList.append(filename) cdate+=inc return fileNameList

包裝代碼以提取數據並填充99.99，該函數的輸入為文件名。

 def dataExtraction(filename): file_exists = os.path.isfile(filename) if file_exists: f = nc.Dataset(filename,'r') times = f.variables['Times'][1:] num_lines = times.shape[0] for i in i_space: for j in j_space: for k in k_space: if file_exists: if num_lines == 144: u = f.variables['U'][1:,k,j,i] v = f.variables['V'][1:,k,j,i] wspd = np.sqrt(u**2.+v**2.) w = f.variables['W'][1:,k,j,i] p = f.variables['P'][1:,k,j,i] t = f.variables['T'][1:,k,j,i] if num_lines < 144: print "partial files for WRF: "+ filename u = np.ones((144,))*99.99 v = np.ones((144,))*99.99 wspd = np.ones((144,))*99.99 w = np.ones((144,))*99.99 p = np.ones((144,))*99.99 t = np.ones((144,))*99.99 else: u = np.ones((144,))*99.99 v = np.ones((144,))*99.99 wspd = np.ones((144,))*99.99 w = np.ones((144,))*99.99 p = np.ones((144,))*99.99 t = np.ones((144,))*99.99 counter=counter+1 predictors_wrf.append(u) predictors_wrf.append(v) predictors_wrf.append(wspd) predictors_wrf.append(w) predictors_wrf.append(p) predictors_wrf.append(t) u_names = 'u_'+str(k)+'_'+str(j)+'_'+str(i) v_names = 'v_'+str(k)+'_'+str(j)+'_'+str(i) wspd_names = 'wspd_'+str(k)+'_'+str(j)+'_'+str(i) w_names = 'w_'+str(k)+'_'+str(j)+'_'+str(i) p_names = 'p_'+str(k)+'_'+str(j)+'_'+str(i) t_names = 't_'+str(k)+'_'+str(j)+'_'+str(i) names_wrf.append(u_names) names_wrf.append(v_names) names_wrf.append(wspd_names) names_wrf.append(w_names) names_wrf.append(p_names) names_wrf.append(t_names) return zip(predictors_wrf, names_wrf)

使用多重處理來完成您的工作。 通常，所有計算機都具有不止1個CPU內核。 當進行大量CPU計算時，多處理將有助於提高速度。 根據我以前的經驗，多處理將減少多達2/3的時間用於大型數據集。
更新：在2017年2月25日再次測試了我的代碼文件之后，我發現使用8個核心處理一個巨大的數據集為我節省了90％的折疊時間。
```
 if __name__ == '__main__': from multiprocessing import Pool # This should be in the beginning statements. start_date = '01-01-2017' end_date = '01-15-2017' fileNames = fileNames(start_date, end_date) p = Pool(4) # the cores numbers you want to use. results = p.map(dataExtraction, fileNames) p.close() p.join() 
```
最后，請注意此處的數據結構，因為它非常復雜。 希望這可以幫助。 如有其他疑問，請發表評論。

Answer 3

我沒有太多建議，但有幾點要注意。

不要打開該文件很多次

首先，定義此filename變量，然后在此循環內（內部深入：三個for循環），檢查文件是否存在，並大概在其中打開文件（我不知道nc.Dataset作用，但是我我猜它必須打開文件並閱讀）：

filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00'
    for i in i_space:
        for j in j_space:
            for k in k_space:
                    if os.path.isfile(filename):
                        f = nc.Dataset(filename,'r')

這將是非常低效的。 如果文件在所有循環之前都沒有更改，則可以打開一次。

嘗試使用較少的循環

所有這些嵌套的for循環都使您需要執行的操作數量復雜化。 一般建議：嘗試使用numpy操作代替。

使用CProfile

如果您想知道為什么程序要花很長時間，找出答案的最佳方法之一就是對它們進行概要分析。

我怎樣才能使我的python代碼運行得更快

問題描述

3 個解決方案

解決方案1
2 已采納 2017-02-22 04:38:36

解決方案2
2 2017-02-22 15:55:48

解決方案3
1 2017-02-22 04:29:54

我怎樣才能使我的python代碼運行得更快

問題描述

3 個解決方案

解決方案1 2 已采納 2017-02-22 04:38:36

解決方案2 2 2017-02-22 15:55:48

解決方案3 1 2017-02-22 04:29:54

解決方案1
2 已采納 2017-02-22 04:38:36

解決方案2
2 2017-02-22 15:55:48

解決方案3
1 2017-02-22 04:29:54