[英]How to implement Multiprocessing in Azure Databricks - Python
I need to get details of each file from a directory.我需要从目录中获取每个文件的详细信息。 It is taking longer time.它需要更长的时间。 I need to implement Multiprocessing so that it's execution can be completed early.我需要实现多处理,以便它的执行可以及早完成。
My code is like this:我的代码是这样的:
from pathlib import Path
from os.path import getmtime, getsize
from multiprocessing import Pool, Process
def iterate_directories(root_dir):
for child in Path(root_dir).iterdir():
if child.is_file():
modified_time = datetime.fromtimestamp(getmtime(file)).date()
file_size = getsize(file)
# further steps...
else:
iterate_directories(child) ## I need this to run on separate Process (in Parallel)
I tried to do recursive call using below, but it is not working.我尝试使用下面的方法进行递归调用,但它不起作用。 It comes out of loop immediately.它立即脱离循环。
else:
p = Process(target=iterate_directories, args=(child))
Pros.append(p) # declared Pros as empty list.
p.start()
for p in Pros:
if not p.is_alive():
p.join()
What am I missing here?我在这里想念什么? How can I run for sub-directories in parallel.如何并行运行子目录。
You have to get the directories list first and then you have to use multiprocessing pool to call the function.您必须先获取目录列表,然后必须使用多处理池来调用该函数。
something like below.如下所示。
from pathlib import Path
from os.path import getmtime, getsize
from multiprocessing import Pool, Process
Filedetails = ''
def iterate_directories(root_dir):
for child in Path(root_dir).iterdir():
if child.is_file():
modified_time = datetime.fromtimestamp(getmtime(file)).date()
file_size = getsize(file)
Filedetails = Filedetails + '\n' + '{add file name details}' + modified_time + file_size
else:
iterate_directories(child) ## I need this to run on separate Process (in Parallel)
return Filesdetails #file return from that particular directory
pool = multiprocessing.Pool(processes={define how many processes you like to run in parallel})
results = pool.map(iterate_directories, {explicit directory list })
print(results) #entire collection will be printed here. it basically a list you can iterate individual directory level
. .
pls let me know, how it goes.请让我知道,情况如何。
The problem is this line:问题是这一行:
if not p.is_alive():
What this translates to is that if the process is already complete, only then wait for it to complete, which obviously does not make much sense (you need to remove the not
from the statement).这意味着如果该过程已经完成,则只有等待它完成,这显然没有多大意义(您需要从语句中删除not
)。 Also, it is completely unnecessary as well.此外,这也是完全没有必要的。 Calling .join
does the same thing internally that p.is_alive
does (except one blocks).调用.join
在内部执行与p.is_alive
相同的操作(一个块除外)。 So you can safely just do this:所以你可以安全地这样做:
for p in Pros:
p.join()
The code will then wait for all child processes to finish.然后代码将等待所有子进程完成。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.