简体   繁体   English

使用外部命令对数千个文件进行多处理

[英]Multiprocessing thousands of files with external command

I want to launch an external command from Python for about 8000 files. 我想从Python启动大约8000个文件的外部命令。 Every file is processed independently from the others. 每个文件都独立于其他文件进行处理。 The only constraint is to continue execution once all files have been processed. 唯一的限制是一旦处理完所有文件,便继续执行。 I have 4 physical cores, each one with 2 logical cores ( multiprocessing.cpu_count() returns 8). 我有4个物理核心,每个核心有2个逻辑核心( multiprocessing.cpu_count()返回8)。 My idea was to use a pool of four parallel independent processes that are to be run on 4 of the 8 cores. 我的想法是使用由四个并行的独立进程组成的池,这些进程将在8个内核中的4个上运行。 This way my machine should be usable in the meantime. 这样,我的机器就可以同时使用了。

Here's what I've been doing: 这是我一直在做的事情:

import multiprocessing
import subprocess
import os
from multiprocessing.pool import ThreadPool


def process_files(input_dir, output_dir, option):
    pool = ThreadPool(multiprocessing.cpu_count()/2)
    for filename in os.listdir(input_dir):  # about 8000 files
        f_in = os.path.join(input_dir, filename)
        f_out = os.path.join(output_dir, filename)
        cmd = ['molconvert', option, f_in, '-o', f_out]
        pool.apply_async(subprocess.Popen, (cmd,))
    pool.close()
    pool.join()


def main():
    process_files('dir1', 'dir2', 'mol:H')
    do_some_stuff('dir2')
    process_files('dir2', 'dir3', 'mol:a')
    do_more_stuff('dir3')

A sequential treatment takes 120s for a batch of 100 files. 顺序处理需要120秒才能处理100个文件。 The multiprocessing version outlined above (function process_files ) takes only 20s for the batch. 上面概述的多处理版本(函数process_files )仅需要20 s的批处理时间。 However, when I run process_files on the whole set of 8000 files, my PC hangs and does not un-freeze after one hour. 但是,当我对整个8000个文件集运行process_files时,我的PC挂起,一小时后没有解冻。

My questions are: 我的问题是:

1) I thought ThreadPool is supposed to initialize a pool of processes (of multiprocessing.cpu_count()/2 processes here, to be exact). 1)我认为ThreadPool应该初始化一个进程池(确切地说,是multiprocessing.cpu_count()/2进程)。 However my computer hanging up on 8000 files but not on 100 suggests that maybe the size of the pool is not taken into account. 但是,我的计算机挂起了8000个文件,但没有挂起100个文件,这表明可能没有考虑池的大小。 Either that, or I'm doing something wrong. 要么是我做错了。 Could you explain? 你能解释一下吗?

2) Is this the right way to launch independent processes under Python when each of them must launch an external command, and in such a way that all the resources are not taken up by the processing? 2)当每个进程都必须启动一个外部命令时,以这种方式在Python下启动独立进程是正确的方式吗,并且这种方式不会使所有资源都被该进程占用?

If you are using Python 3, I would consider using the map method of concurrent.futures.ThreadPoolExecutor . 如果您使用的是Python 3,则可以考虑使用concurrent.futures.ThreadPoolExecutormap方法concurrent.futures.ThreadPoolExecutor

Alternatively, you can manage a list of subprocesses yourself. 或者,您可以自己管理子流程列表。

The following example defines a function to start ffmpeg to convert a video file to Theora/Vorbis format. 以下示例定义了一个启动ffmpeg的功能,该功能可将视频文件转换为Theora / Vorbis格式。 It returns a Popen object for each started subprocess. 它为每个启动的子进程返回一个Popen对象。

def startencoder(iname, oname, offs=None):
    args = ['ffmpeg']
    if offs is not None and offs > 0:
        args += ['-ss', str(offs)]
    args += ['-i', iname, '-c:v', 'libtheora', '-q:v', '6', '-c:a',
            'libvorbis', '-q:a', '3', '-sn', oname]
    with open(os.devnull, 'w') as bb:
        p = subprocess.Popen(args, stdout=bb, stderr=bb)
    return p

In the main program, a list of Popen objects representing running subprocesses is maintained like this. 在主程序中,这样维护着表示正在运行的子Popen对象的列表。

outbase = tempname()
ogvlist = []
procs = []
maxprocs = cpu_count()
for n, ifile in enumerate(argv):
    # Wait while the list of processes is full.
    while len(procs) == maxprocs:
        manageprocs(procs)
    # Add a new process
    ogvname = outbase + '-{:03d}.ogv'.format(n + 1)
    procs.append(startencoder(ifile, ogvname, offset))
    ogvlist.append(ogvname)
# All jobs have been submitted, wail for them to finish.
while len(procs) > 0:
    manageprocs(procs)

So a new process is only started when there are less running subprocesses than cores. 因此,仅当正在运行的子流程少于核心时才启动新流程。 Code that is used multiple times is separated into the manageprocs function. 多次使用的代码被分离到manageprocs函数中。

def manageprocs(proclist):
    for pr in proclist:
        if pr.poll() is not None:
            proclist.remove(pr)
    sleep(0.5)

The call to sleep is used to prevent the program from spinning in the loop. 调用sleep可以防止程序在循环中旋转。

I think your basic problem is the use of subprocess.Popen . 我认为您的基本问题是使用subprocess.Popen That method does not wait for a command to complete before returning. 该方法在返回之前不会等待命令完成。 Since the function returns immediately (even though the command is still running), the function is finished as far as your ThreadPool is concerned and it can spawn another...which means that you end up spawning 8000 or so processes. 由于该函数会立即返回(即使命令仍在运行),因此就您的ThreadPool而言,该函数已完成,并且可以生成另一个……这意味着您最终会生成8000个左右的进程。

You would probably have better luck using subprocess.check_call : 使用subprocess.check_call可能会带来更好的运气:

Run command with arguments.  Wait for command to complete.  If
the exit code was zero then return, otherwise raise
CalledProcessError.  The CalledProcessError object will have the
return code in the returncode attribute.

So: 所以:

def process_files(input_dir, output_dir, option):
    pool = ThreadPool(multiprocessing.cpu_count()/2)
    for filename in os.listdir(input_dir):  # about 8000 files
        f_in = os.path.join(input_dir, filename)
        f_out = os.path.join(output_dir, filename)
        cmd = ['molconvert', option, f_in, '-o', f_out]
        pool.apply_async(subprocess.check_call, (cmd,))
    pool.close()
    pool.join()

If you really don't care about the exit code, then you may want subprocess.call , which will not raise an exception in the event of a non-zero exit code from the process. 如果您真的不在乎退出代码,那么您可能需要subprocess.call ,如果该过程的退出代码为非零,则不会引发异常。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM