為非常大的 dataframe 列表運行 for 循環的更快方法

Question

我在彼此內部使用兩個 for 循環，使用 dataframe 列表中的元素組合來計算值。 該列表由大量數據幀組成，使用兩個 for 循環需要花費大量時間。

有什么辦法可以更快地完成手術嗎？

我用虛擬名稱引用的函數是我計算結果的函數。

我的代碼如下所示：

 conf_list = []

 for tr in range(len(trajectories)):
     df_1 = trajectories[tr]

     if len(df_1) == 0:
        continue
   
     for tt in range(len(trajectories)):
         df_2 = trajectories[tt]

         if len(df_2) == 0:
            continue

         if df_1.equals(df_2) or df_1['time'].iloc[0] > df_2['time'].iloc[-1] or df_2['time'].iloc[0] > df_1['time'].iloc[-1]:
            continue

         df_temp = cartesian_product_basic(df_1,df_2)
    
         flg, df_temp = another_function(df_temp)
    
         if flg == 0:
             continue

         flg_h = some_other_function(df_temp)
    
         if flg_h == 1:
            conf_list.append(1)

我的輸入列表包含大約 5000 個數據框，看起來像（有幾百行）

ID	X	是	z	時間
1個	5個	7	2個	5個

我所做的是通過兩個數據幀的組合獲得笛卡爾積，並且為每對夫婦計算另一個值“c”。 如果這個值 c 滿足條件，那么我將一個元素添加到我的 c_list 中，以便我可以獲得滿足要求的最終夫妻數量。

了解更多信息；

a_function(df_1, df_2) 是一個 function 獲取兩個數據幀的笛卡爾積。

another_function 看起來像這樣：

  def another_function(df_temp):
      df_temp['z_dif'] =      nwh((df_temp['time_x'] == df_temp['time_y'])
                                          , abs(df_temp['z_x']-  df_temp['z_y']) , np.nan)

      df_temp = df_temp.dropna() 

      df_temp['vert_conf'] = nwh((df_temp['z_dif'] >= 1000)
                                          , np.nan , 1)
      df_temp = df_temp.dropna() 

      if len(df_temp) == 0:
       flg = 0
      else:
       flg = 1
    
      return flg, df_temp

some_other_function 看起來像這樣：

  def some_other_function(df_temp):
      df_temp['x_dif'] =   df_temp['x_x']*df_temp['x_y']
      df_temp['y_dif'] = df_temp['y_x']*df_temp['y_y']
      df_temp['hor_dif'] = hypot(df_temp['x_dif'], df_temp['y_dif'])

      df_temp['conf'] = np.where((df_temp['hor_dif']<=5)
                                          , 1 , np.nan)
      if df_temp['conf'].sum()>0:
         flg_h = 1
    
     return flg_h

Answer 1

以下是讓你的代碼運行得更快的方法：

而不是for-loop使用列表理解。
使用內置函數，如map 、 filter 、 sum 等，這將使您的代碼更快。
不使用 '。' 或點運算符，例如

Import datetime
A=datetime.datetime.now() #dont use this 
From datetime.datetime import now as timenow
A=timenow()# use this

使用基於 c/c++ 的操作庫，如 numpy。
不要不必要地轉換數據類型。
在無限循環中，使用1而不是“ True ”
使用內置庫。
如果數據不會改變，將其轉換為元組
使用字符串連接
使用多項分配
使用發電機
使用if-else檢查 Boolean 值時，避免使用賦值運算符。

# Instead of Below approach
if a==1:
    print('a is 1')
else:
    print('a is 0')

# Try this approach 
if a:
    print('a is 1')
else:
    print('a is 0')

# This would help as a portion of time is reduce which was used in check the 2 values.

有用的參考：

為非常大的 dataframe 列表運行 for 循環的更快方法

問題描述

1 個解決方案

解決方案1
1 2022-02-26 13:32:20

為非常大的 dataframe 列表運行 for 循環的更快方法

問題描述

1 個解決方案

解決方案1 1 2022-02-26 13:32:20

解決方案1
1 2022-02-26 13:32:20