简体   繁体   中英

How to parallelise a nested loop in dask

I have the following code:

while (i< 10):
  for i in range(0, len(df_1)):
      new_df_1 = df_1.iloc[i]
      for j in (len(df_2)):
         new_df_2 = df_2.iloc[j]
         client.compute(self.func(i, new_df_1, new_df_2), scheduler="processes"), 
          break

I don't know how to use dask in such a nested loops to speed up the code. I tried to make the inner function as a function like below, but raises error.

This is what I have tried:

while (i< 10):
  for i in range(0, len(df_1)):
      new_df_1 = df_1.iloc[i]
      def process_l(i, client, new_df_1, new_df_2):
         for j in (len(df_2)):
            new_df_2 = df_2.iloc[j]
            client.compute(self.func(i, new_df_1, new_df_2), scheduler="processes"), 
            break

      client.submit(process_l(i, new_df_1, new_df_2)
    

Calling .compute() will stop further execution of the code until the results of .compute() are ready. Instead you might want to use delayed or client.submit . Here's a rough suggestion:

futs = []
# to avoid the while loop
for i in range(0, min(10, len(df_1))):
    new_df_1 = df_1.iloc[i]
    for j in range(0, len(df_2)):
        new_df_2 = df_2.iloc[j]

        # this will submit future and proceed with the code without
        # waiting for the result
        fut = client.submit(self.func, i, new_df_1, new_df_2, scheduler="processes")
        futs.append(fut)

results = client.gather(futs) # this waits for all results

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM