简体   繁体   中英

faster way to run a for loop for a very large dataframe list

I am using two for loops inside each other to calculate a value using combinations of elements in a dataframe list. the list consists of large number of dataframes and using two for loops takes considerable amount of time.

Is there a way i can do the operation faster?

the functions I refer with dummy names are the ones where I calculate the results.

My code looks like this:

 conf_list = []

 for tr in range(len(trajectories)):
     df_1 = trajectories[tr]

     if len(df_1) == 0:
        continue
   
     for tt in range(len(trajectories)):
         df_2 = trajectories[tt]

         if len(df_2) == 0:
            continue

         if df_1.equals(df_2) or df_1['time'].iloc[0] > df_2['time'].iloc[-1] or df_2['time'].iloc[0] > df_1['time'].iloc[-1]:
            continue

         df_temp = cartesian_product_basic(df_1,df_2)
    
         flg, df_temp = another_function(df_temp)
    
         if flg == 0:
             continue

         flg_h = some_other_function(df_temp)
    
         if flg_h == 1:
            conf_list.append(1)
    

My input list consist of around 5000 dataframes looking like (having several hundreds of rows)

id x y z time
1 5 7 2 5

and what i do is I get the cartesian product with combinations of two dataframes and for each couple I calculate another value 'c'. If this value c meets a condition then I add an element to my c_list so that I can get the final number of couples meeting the requirement.

For further info;

a_function(df_1, df_2) is a function getting the cartesian product of two dataframes.

another_function looks like this:

  def another_function(df_temp):
      df_temp['z_dif'] =      nwh((df_temp['time_x'] == df_temp['time_y'])
                                          , abs(df_temp['z_x']-  df_temp['z_y']) , np.nan)

      df_temp = df_temp.dropna() 

      df_temp['vert_conf'] = nwh((df_temp['z_dif'] >= 1000)
                                          , np.nan , 1)
      df_temp = df_temp.dropna() 

      if len(df_temp) == 0:
       flg = 0
      else:
       flg = 1
    
      return flg, df_temp

and some_other_function looks like this:

  def some_other_function(df_temp):
      df_temp['x_dif'] =   df_temp['x_x']*df_temp['x_y']
      df_temp['y_dif'] = df_temp['y_x']*df_temp['y_y']
      df_temp['hor_dif'] = hypot(df_temp['x_dif'], df_temp['y_dif'])

      df_temp['conf'] = np.where((df_temp['hor_dif']<=5)
                                          , 1 , np.nan)
      if df_temp['conf'].sum()>0:
         flg_h = 1
    
     return flg_h       

The following are the way to make your code run faster:

  • Instead of for-loop use list comprehension.
  • use built-in functions like map , filter , sum ect, this would make your code faster.
  • Do not use '.' or dot operants, for example
Import datetime
A=datetime.datetime.now() #dont use this 
From datetime.datetime import now as timenow
A=timenow()# use this
  • Use c/c++ based operation libraries like numpy.
  • Don't convert datatypes unnecessarily.
  • in infinite loops, use 1 instead of " True "
  • Use built-in Libraries.
  • if the data would not change, convert it to a tuple
  • Use String Concatenation
  • Use Multiple Assignments
  • Use Generators
  • When using if-else to check a Boolean value, avoid using assignment operator.
# Instead of Below approach
if a==1:
    print('a is 1')
else:
    print('a is 0')

# Try this approach 
if a:
    print('a is 1')
else:
    print('a is 0')

# This would help as a portion of time is reduce which was used in check the 2 values.

Usefull references:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM