Python /多处理：进程似乎无法启动

Question

我有一个读取二进制文件并将每个字节转换为相应字符序列的函数。 例如，0x05变成'AACC'，0x2A变成'AGGG'等...目前，读取文件并转换字节的函数是线性的，由于要转换的文件介于25kb和2Mb之间，因此可能需要花费相当多的时间。一会儿。

因此，我正在尝试使用多处理来划分任务并希望提高速度。 但是，我只是无法正常工作。 下面是线性函数，尽管运行缓慢，但仍起作用。

def fileToRNAString(_file):

    if (_file and os.path.isfile(_file)):
        rnaSequences = []
        blockCount = 0
        blockSize = 2048
        printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
        with open(_file, "rb") as hFile:
            buf = hFile.read(blockSize)
            while buf:
                decSequenceToRNA(blockCount, buf, rnaSequences)
                blockCount = blockCount + 1
                buf = hFile.read(blockSize)
    else:
        printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
    return rnaSequences

' takes the buffer read and converts each byte to the required string. 注意：函数' '读取缓冲区并将每个字节转换为所需的字符串。 执行后，该函数返回一个包含块号和字符串的元组，例如（1，'ACCGTAGATTA ...'），最后，我有一个可用的这些元组数组。

我试图将函数转换为使用Python的多处理；

def fileToRNAString(_file):
    rnaSequences = []
    if (_file and os.path.isfile(_file)):
        blockCount = 0
        blockSize = 2048
        printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
        workers = []
        with open(_file, "rb") as hFile:
            buf = hFile.read(blockSize)
            while buf:
                p = Process(target=decSequenceToRNA, args=(blockCount, buf, rnaSequences))
                p.start()
                workers.append(p)
                blockCount = blockCount + 1
                buf = hFile.read(blockSize)
        for p in workers:
            p.join()
    else:
        printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
    return rnaSequences

但是，似乎没有进程启动，因为当运行此函数时，将返回一个空数组。 ' is not displayed; 不显示“ ”中打印到控制台的任何消息；

>>>fileToRNAString(testfile)
[!] Converting /root/src/amino56/M1H2.bin into RNA string (2048 bytes/block).

and using PyCrust to test the functions on Python Version: 2.7.3. 与这里的问题不同的是，我正在运行并使用PyCrust在Python版本上测试功能：2.7.3 。 我正在使用以下软件包：

import os
import re
import sys
import urllib2
import requests
import logging
import hashlib
import argparse
import tempfile
import shutil
import feedparser
from multiprocessing import Process

我想帮助您弄清楚为什么我的代码无法正常工作，或者是否想在其他地方使流程正常工作。 也欢迎提出改进代码的建议。 ' for reference: 以下是“ ”供参考：

def decSequenceToRNA(_idxSeq, _byteSequence, _rnaSequences):
    rnaSequence = ''
    printAndLog("!", "Processing block %d (%d bytes)" % (_idxSeq, len(_byteSequence)))
    for b in _byteSequence:
        rnaSequence = rnaSequence + base10ToRNA(ord(b))
    printAndLog("+", "Block %d completed. RNA of %d nucleotides generated." % (_idxSeq, len(rnaSequence)))
    _rnaSequences.append((_idxSeq, rnaSequence))

Answer 1

decSequenceToRNA在自己的进程中运行，这意味着它在主进程中获得每个数据结构的单独副本。 这意味着，当你追加到_rnaSequences在decSequenceToRNA ，它有没有影响rnaSequences父进程。 那可以解释为什么返回一个空列表。

您有两种选择可以解决此问题。 首先，创建一个list ，该list可以使用multiprocessing.Manager在进程之间共享。 例如：

import multiprocessing

def f(shared_list):
    shared_list.append(1)

if __name__ == "__main__":
    normal_list = []
    p = multiprocessing.Process(target=f, args=(normal_list,))
    p.start()
    p.join()
    print(normal_list)

    m = multiprocessing.Manager()
    shared_list = m.list()
    p = multiprocessing.Process(target=f, args=(shared_list,))
    p.start()
    p.join()
    print(shared_list)

输出：

[]   # Normal list didn't work, the appended '1' didn't make it to the main process
[1]  # multiprocessing.Manager() list works fine

将此应用于您的代码只需要替换

rnaSequences = []

同

m = multiprocessing.Manager()
rnaSequences = m.list()

或者，您可以（可能应该）使用multiprocessing.Pool而不是为每个块创建单独的Process 。 我不确定hFile大小或正在读取的块的大小，但是如果多于multiprocessing.cpu_count()块，则会通过为每个块生成进程来损害性能。 使用Pool ，可以使进程计数保持不变，并轻松创建rnaSequence列表：

def decSequenceToRNA(_idxSeq, _byteSequence):
    rnaSequence = ''
    printAndLog("!", "Processing block %d (%d bytes)" % (_idxSeq, len(_byteSequence)))
    for b in _byteSequence:
        rnaSequence = rnaSequence + base10ToRNA(ord(b))
    printAndLog("+", "Block %d completed. RNA of %d nucleotides generated." % (_idxSeq, len(rnaSequence)))
    return _idxSeq, rnaSequence

def fileToRNAString(_file):
    rnaSequences = []
    if (_file and os.path.isfile(_file)):
        blockCount = 0
        blockSize = 2048
        printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
        results = []
        p = multiprocessing.Pool()  # Creates a pool of cpu_count() processes
        with open(_file, "rb") as hFile:
            buf = hFile.read(blockSize)
            while buf:
                result = pool.apply_async(decSequenceToRNA, blockCount, buf)
                results.append(result)
                blockCount = blockCount + 1
                buf = hFile.read(blockSize)
        rnaSequences = [r.get() for r in results]
        pool.close()
        pool.join()
    else:
        printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
    return rnaSequences

请注意，我们不再将rnaSequences列表传递给子级。 取而代之的是，我们只是将本应返回的结果返回给父级（我们不能使用Process ），并在那里建立列表。

Answer 2

尝试编写此代码（参数列表末尾的逗号）

p = Process(target=decSequenceToRNA, args=(blockCount, buf, rnaSequences,))

Python /多处理：进程似乎无法启动

问题描述

2 个解决方案

解决方案1
1 已采纳 2014-08-10 05:39:18

解决方案2
-1 2014-08-10 05:35:36

Python /多处理：进程似乎无法启动

问题描述

2 个解决方案

解决方案1 1 已采纳 2014-08-10 05:39:18

解决方案2 -1 2014-08-10 05:35:36

解决方案1
1 已采纳 2014-08-10 05:39:18

解决方案2
-1 2014-08-10 05:35:36