简体   繁体   English

Python /多处理:进程似乎无法启动

[英]Python/Multiprocessing : Processes does not seem to start

I have a function which reads a binary file and converts each byte into a corresponding sequence of characters. 我有一个读取二进制文件并将每个字节转换为相应字符序列的函数。 For example, 0x05 becomes 'AACC', 0x2A becomes 'AGGG' etc...The function which reads the file and converts the bytes is currently a linear one and since the files to convert are anywhere between 25kb and 2Mb, this can take quite a while. 例如,0x05变成'AACC',0x2A变成'AGGG'等...目前,读取文件并转换字节的函数是线性的,由于要转换的文件介于25kb和2Mb之间,因此可能需要花费相当多的时间。一会儿。

Therefore, I'm trying to use multiprocessing to divide the task and hopefully improve speed. 因此,我正在尝试使用多处理来划分任务并希望提高速度。 However, I just can't get it to work. 但是,我只是无法正常工作。 Below is the linear function, which works, albeit slowly; 下面是线性函数,尽管运行缓慢,但仍起作用。

def fileToRNAString(_file):

    if (_file and os.path.isfile(_file)):
        rnaSequences = []
        blockCount = 0
        blockSize = 2048
        printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
        with open(_file, "rb") as hFile:
            buf = hFile.read(blockSize)
            while buf:
                decSequenceToRNA(blockCount, buf, rnaSequences)
                blockCount = blockCount + 1
                buf = hFile.read(blockSize)
    else:
        printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
    return rnaSequences

Note: The function ' decSequenceToRNA ' takes the buffer read and converts each byte to the required string. 注意:函数' decSequenceToRNA '读取缓冲区并将每个字节转换为所需的字符串。 Upon execution, the function returns a tuple which contain the block number and the string, eg (1, 'ACCGTAGATTA...') and at the end, I have an array of these tuples available. 执行后,该函数返回一个包含块号和字符串的元组,例如(1,'ACCGTAGATTA ...'),最后,我有一个可用的这些元组数组。

I've tried to convert the function to use the multiprocessing of Python; 我试图将函数转换为使用Python的多处理;

def fileToRNAString(_file):
    rnaSequences = []
    if (_file and os.path.isfile(_file)):
        blockCount = 0
        blockSize = 2048
        printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
        workers = []
        with open(_file, "rb") as hFile:
            buf = hFile.read(blockSize)
            while buf:
                p = Process(target=decSequenceToRNA, args=(blockCount, buf, rnaSequences))
                p.start()
                workers.append(p)
                blockCount = blockCount + 1
                buf = hFile.read(blockSize)
        for p in workers:
            p.join()
    else:
        printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
    return rnaSequences

However, no processes seems to even start, as when this function is ran, an empty array is returned. 但是,似乎没有进程启动,因为当运行此函数时,将返回一个空数组。 Any message printed to the console in ' decSequenceToRNA ' is not displayed; 不显示“ decSequenceToRNA ”中打印到控制台的任何消息;

>>>fileToRNAString(testfile)
[!] Converting /root/src/amino56/M1H2.bin into RNA string (2048 bytes/block).

Unlike this question here, I'm running Linux shiva 3.14-kali1-amd64 #1 SMP Debian 3.14.5-1kali1 (2014-06-07) x86_64 GNU/Linux and using PyCrust to test the functions on Python Version: 2.7.3. 与这里的问题不同的是,我正在运行Linux shiva 3.14-kali1-amd64#1 SMP Debian 3.14.5-1kali1(2014-06-07)x86_64 GNU / Linux,并使用PyCrust在Python版本上测试功能:2.7.3 。 I'm using the following packages: 我正在使用以下软件包:

import os
import re
import sys
import urllib2
import requests
import logging
import hashlib
import argparse
import tempfile
import shutil
import feedparser
from multiprocessing import Process

I'd like help to figure out why my code does not work, of if I'm missing something elsewhere to make the Process works. 我想帮助您弄清楚为什么我的代码无法正常工作,或者是否想在其他地方使流程正常工作。 Also open to suggestions for improving the code. 也欢迎提出改进代码的建议。 Below is ' decSequenceToRNA ' for reference: 以下是“ decSequenceToRNA ”供参考:

def decSequenceToRNA(_idxSeq, _byteSequence, _rnaSequences):
    rnaSequence = ''
    printAndLog("!", "Processing block %d (%d bytes)" % (_idxSeq, len(_byteSequence)))
    for b in _byteSequence:
        rnaSequence = rnaSequence + base10ToRNA(ord(b))
    printAndLog("+", "Block %d completed. RNA of %d nucleotides generated." % (_idxSeq, len(rnaSequence)))
    _rnaSequences.append((_idxSeq, rnaSequence))

decSequenceToRNA is running in its own process, which means it gets its own, separate copy of every data structure in the main process. decSequenceToRNA在自己的进程中运行,这意味着它在主进程中获得每个数据结构的单独副本。 That means that when you append to _rnaSequences in decSequenceToRNA , it's has no effect on rnaSequences in the parent process. 这意味着,当你追加到_rnaSequencesdecSequenceToRNA ,它有没有影响rnaSequences父进程。 That would explain why an empty list is being returned. 那可以解释为什么返回一个空列表。

You have two options to address this. 您有两种选择可以解决此问题。 First, is to create a list that can be shared between processes using multiprocessing.Manager . 首先,创建一个list ,该list可以使用multiprocessing.Manager在进程之间共享。 For example: 例如:

import multiprocessing

def f(shared_list):
    shared_list.append(1)

if __name__ == "__main__":
    normal_list = []
    p = multiprocessing.Process(target=f, args=(normal_list,))
    p.start()
    p.join()
    print(normal_list)

    m = multiprocessing.Manager()
    shared_list = m.list()
    p = multiprocessing.Process(target=f, args=(shared_list,))
    p.start()
    p.join()
    print(shared_list)

Output: 输出:

[]   # Normal list didn't work, the appended '1' didn't make it to the main process
[1]  # multiprocessing.Manager() list works fine

Applying this to your code would just require replacing 将此应用于您的代码只需要替换

rnaSequences = []

With

m = multiprocessing.Manager()
rnaSequences = m.list()

Alternatively, you could (and probably should) use a multiprocessing.Pool instead of creating individual Process for each chunk. 或者,您可以(可能应该)使用multiprocessing.Pool而不是为每个块创建单独的Process I'm not sure how large hFile is or how big the chunks you're reading are, but if there are more than multiprocessing.cpu_count() chunks, you're going to hurt performance by spawning processes for every chunk. 我不确定hFile大小或正在读取的块的大小,但是如果多于multiprocessing.cpu_count()块,则会通过为每个块生成进程来损害性能。 Using a Pool , you can keep your process count constant, and easily create your rnaSequence list: 使用Pool ,可以使进程计数保持不变,并轻松创建rnaSequence列表:

def decSequenceToRNA(_idxSeq, _byteSequence):
    rnaSequence = ''
    printAndLog("!", "Processing block %d (%d bytes)" % (_idxSeq, len(_byteSequence)))
    for b in _byteSequence:
        rnaSequence = rnaSequence + base10ToRNA(ord(b))
    printAndLog("+", "Block %d completed. RNA of %d nucleotides generated." % (_idxSeq, len(rnaSequence)))
    return _idxSeq, rnaSequence

def fileToRNAString(_file):
    rnaSequences = []
    if (_file and os.path.isfile(_file)):
        blockCount = 0
        blockSize = 2048
        printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
        results = []
        p = multiprocessing.Pool()  # Creates a pool of cpu_count() processes
        with open(_file, "rb") as hFile:
            buf = hFile.read(blockSize)
            while buf:
                result = pool.apply_async(decSequenceToRNA, blockCount, buf)
                results.append(result)
                blockCount = blockCount + 1
                buf = hFile.read(blockSize)
        rnaSequences = [r.get() for r in results]
        pool.close()
        pool.join()
    else:
        printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
    return rnaSequences

Note that we no longer pass the rnaSequences list to the child. 请注意,我们不再将rnaSequences列表传递给子级。 Instead, we just return the result we would have appened back to the parent (which we can't do with Process ), and build the list there. 取而代之的是,我们只是将本应返回的结果返回给父级(我们不能使用Process ),并在那里建立列表。

尝试编写此代码(参数列表末尾的逗号)

p = Process(target=decSequenceToRNA, args=(blockCount, buf, rnaSequences,))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM