如何在 Azure Databricks 中实现多处理 - Python

Question

I need to get details of each file from a directory.我需要从目录中获取每个文件的详细信息。 It is taking longer time.它需要更长的时间。 I need to implement Multiprocessing so that it's execution can be completed early.我需要实现多处理，以便它的执行可以及早完成。

My code is like this:我的代码是这样的：

from pathlib import Path
from os.path import getmtime, getsize
from multiprocessing import Pool, Process

def iterate_directories(root_dir):
  
  for child in Path(root_dir).iterdir():
    
    if child.is_file():
        modified_time = datetime.fromtimestamp(getmtime(file)).date()
        file_size = getsize(file)
         # further steps...
      
    else:
      iterate_directories(child) ## I need this to run on separate Process (in Parallel)

I tried to do recursive call using below, but it is not working.我尝试使用下面的方法进行递归调用，但它不起作用。 It comes out of loop immediately.它立即脱离循环。

else:
    p = Process(target=iterate_directories, args=(child))
    Pros.append(p) # declared Pros as empty list.
    p.start()

for p in Pros:
  if not p.is_alive():
     p.join()

What am I missing here?我在这里想念什么？ How can I run for sub-directories in parallel.如何并行运行子目录。

Answer 1

You have to get the directories list first and then you have to use multiprocessing pool to call the function.您必须先获取目录列表，然后必须使用多处理池来调用该函数。

something like below.如下所示。

from pathlib import Path
from os.path import getmtime, getsize
from multiprocessing import Pool, Process
Filedetails = ''

def iterate_directories(root_dir):
  
  for child in Path(root_dir).iterdir():
    
    if child.is_file():
        modified_time = datetime.fromtimestamp(getmtime(file)).date()
        file_size = getsize(file)
         Filedetails = Filedetails + '\n' + '{add file name details}' + modified_time + file_size
 else:
      iterate_directories(child) ## I need this to run on separate Process (in Parallel)

return Filesdetails #file return from that particular directory 

   pool = multiprocessing.Pool(processes={define how many processes you like to run in parallel})
    results = pool.map(iterate_directories, {explicit directory list })
    print(results) #entire collection will be printed here. it basically a list you can iterate individual directory level

. .

pls let me know, how it goes.请让我知道，情况如何。

Answer 2

The problem is this line:问题是这一行：

if not p.is_alive():

What this translates to is that if the process is already complete, only then wait for it to complete, which obviously does not make much sense (you need to remove the not from the statement).这意味着如果该过程已经完成，则只有等待它完成，这显然没有多大意义（您需要从语句中删除not ）。 Also, it is completely unnecessary as well.此外，这也是完全没有必要的。 Calling .join does the same thing internally that p.is_alive does (except one blocks).调用.join在内部执行与p.is_alive相同的操作（一个块除外）。 So you can safely just do this:所以你可以安全地这样做：

for p in Pros:
    p.join()

The code will then wait for all child processes to finish.然后代码将等待所有子进程完成。

如何在 Azure Databricks 中实现多处理 - Python

问题描述

2 个解决方案

解决方案1
0 2022-07-04 13:13:05

解决方案2
0 2022-07-04 14:57:57

如何在 Azure Databricks 中实现多处理 - Python

问题描述

2 个解决方案

解决方案1 0 2022-07-04 13:13:05

解决方案2 0 2022-07-04 14:57:57

解决方案1
0 2022-07-04 13:13:05

解决方案2
0 2022-07-04 14:57:57