简体   繁体   English

如何在 Azure Databricks 中实现多处理 - Python

[英]How to implement Multiprocessing in Azure Databricks - Python

I need to get details of each file from a directory.我需要从目录中获取每个文件的详细信息。 It is taking longer time.它需要更长的时间。 I need to implement Multiprocessing so that it's execution can be completed early.我需要实现多处理,以便它的执行可以及早完成。

My code is like this:我的代码是这样的:

from pathlib import Path
from os.path import getmtime, getsize
from multiprocessing import Pool, Process

def iterate_directories(root_dir):
  
  for child in Path(root_dir).iterdir():
    
    if child.is_file():
        modified_time = datetime.fromtimestamp(getmtime(file)).date()
        file_size = getsize(file)
         # further steps...
      
    else:
      iterate_directories(child) ## I need this to run on separate Process (in Parallel)
    

I tried to do recursive call using below, but it is not working.我尝试使用下面的方法进行递归调用,但它不起作用。 It comes out of loop immediately.它立即脱离循环。

else:
    p = Process(target=iterate_directories, args=(child))
    Pros.append(p) # declared Pros as empty list.
    p.start()

for p in Pros:
  if not p.is_alive():
     p.join()

What am I missing here?我在这里想念什么? How can I run for sub-directories in parallel.如何并行运行子目录。

You have to get the directories list first and then you have to use multiprocessing pool to call the function.您必须先获取目录列表,然后必须使用多处理池来调用该函数。

something like below.如下所示。

from pathlib import Path
from os.path import getmtime, getsize
from multiprocessing import Pool, Process
Filedetails = ''

def iterate_directories(root_dir):
  
  for child in Path(root_dir).iterdir():
    
    if child.is_file():
        modified_time = datetime.fromtimestamp(getmtime(file)).date()
        file_size = getsize(file)
         Filedetails = Filedetails + '\n' + '{add file name details}' + modified_time + file_size
 else:
      iterate_directories(child) ## I need this to run on separate Process (in Parallel)

return Filesdetails #file return from that particular directory 

   pool = multiprocessing.Pool(processes={define how many processes you like to run in parallel})
    results = pool.map(iterate_directories, {explicit directory list })
    print(results) #entire collection will be printed here. it basically a list you can iterate individual directory level

. .

pls let me know, how it goes.请让我知道,情况如何。

The problem is this line:问题是这一行:

if not p.is_alive():

What this translates to is that if the process is already complete, only then wait for it to complete, which obviously does not make much sense (you need to remove the not from the statement).这意味着如果该过程已经完成,则只有等待它完成,这显然没有多大意义(您需要从语句中删除not )。 Also, it is completely unnecessary as well.此外,这也是完全没有必要的。 Calling .join does the same thing internally that p.is_alive does (except one blocks).调用.join在内部执行与p.is_alive相同的操作(一个块除外)。 So you can safely just do this:所以你可以安全地这样做:

for p in Pros:
    p.join()

The code will then wait for all child processes to finish.然后代码将等待所有子进程完成。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何保护 Azure 数据块中的 python 代码 - How to secure python code in Azure databricks Python 多处理 - 如何实现共享计数器和队列 - Python Multiprocessing - How to implement a shared counter and Queue 如何在 Databricks 中记录自定义 Python 应用程序日志并将其移动到 Azure - How to log custom Python application logs in Databricks and move it to Azure azure databricks notebook - (Python) 如何传递元组列表的参数 - azure databricks notebook - (Python) how to pass argument which is list of tuples 如何在python中实现数据并行(multiprocessing.Pool)装饰器? - How to implement a data-parallelising (multiprocessing.Pool) decorator in python? Databricks 与 python 3 为 Azure SQl 数据库和 Z23EEEB4347BDD265DFC6B7EE9A3B7 - Databricks with python 3 for Azure SQl Databas and python 无法在python中实现Multiprocessing(参数不正确) - Not able to implement Multiprocessing in python(The parameter is incorrect) 如何在 Databricks 中升级 python 版本 - How to upgrade python version in Databricks Azure Databricks、Python - 将 json 列字符串转换为数据框 - Azure Databricks, Python - convert json column string to dataframe 使用数据块 python 代码删除 azure 帐户中的文件 - Deleting files in azure account using databricks python code
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM