[英]Python/Multiprocessing : Processes does not seem to start
I have a function which reads a binary file and converts each byte into a corresponding sequence of characters. 我有一个读取二进制文件并将每个字节转换为相应字符序列的函数。 For example, 0x05 becomes 'AACC', 0x2A becomes 'AGGG' etc...The function which reads the file and converts the bytes is currently a linear one and since the files to convert are anywhere between 25kb and 2Mb, this can take quite a while.
例如,0x05变成'AACC',0x2A变成'AGGG'等...目前,读取文件并转换字节的函数是线性的,由于要转换的文件介于25kb和2Mb之间,因此可能需要花费相当多的时间。一会儿。
Therefore, I'm trying to use multiprocessing to divide the task and hopefully improve speed. 因此,我正在尝试使用多处理来划分任务并希望提高速度。 However, I just can't get it to work.
但是,我只是无法正常工作。 Below is the linear function, which works, albeit slowly;
下面是线性函数,尽管运行缓慢,但仍起作用。
def fileToRNAString(_file):
if (_file and os.path.isfile(_file)):
rnaSequences = []
blockCount = 0
blockSize = 2048
printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
with open(_file, "rb") as hFile:
buf = hFile.read(blockSize)
while buf:
decSequenceToRNA(blockCount, buf, rnaSequences)
blockCount = blockCount + 1
buf = hFile.read(blockSize)
else:
printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
return rnaSequences
Note: The function ' decSequenceToRNA ' takes the buffer read and converts each byte to the required string. 注意:函数' decSequenceToRNA '读取缓冲区并将每个字节转换为所需的字符串。 Upon execution, the function returns a tuple which contain the block number and the string, eg (1, 'ACCGTAGATTA...') and at the end, I have an array of these tuples available.
执行后,该函数返回一个包含块号和字符串的元组,例如(1,'ACCGTAGATTA ...'),最后,我有一个可用的这些元组数组。
I've tried to convert the function to use the multiprocessing of Python; 我试图将函数转换为使用Python的多处理;
def fileToRNAString(_file):
rnaSequences = []
if (_file and os.path.isfile(_file)):
blockCount = 0
blockSize = 2048
printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
workers = []
with open(_file, "rb") as hFile:
buf = hFile.read(blockSize)
while buf:
p = Process(target=decSequenceToRNA, args=(blockCount, buf, rnaSequences))
p.start()
workers.append(p)
blockCount = blockCount + 1
buf = hFile.read(blockSize)
for p in workers:
p.join()
else:
printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
return rnaSequences
However, no processes seems to even start, as when this function is ran, an empty array is returned. 但是,似乎没有进程启动,因为当运行此函数时,将返回一个空数组。 Any message printed to the console in ' decSequenceToRNA ' is not displayed;
不显示“ decSequenceToRNA ”中打印到控制台的任何消息;
>>>fileToRNAString(testfile)
[!] Converting /root/src/amino56/M1H2.bin into RNA string (2048 bytes/block).
Unlike this question here, I'm running Linux shiva 3.14-kali1-amd64 #1 SMP Debian 3.14.5-1kali1 (2014-06-07) x86_64 GNU/Linux and using PyCrust to test the functions on Python Version: 2.7.3. 与这里的问题不同的是,我正在运行Linux shiva 3.14-kali1-amd64#1 SMP Debian 3.14.5-1kali1(2014-06-07)x86_64 GNU / Linux,并使用PyCrust在Python版本上测试功能:2.7.3 。 I'm using the following packages:
我正在使用以下软件包:
import os
import re
import sys
import urllib2
import requests
import logging
import hashlib
import argparse
import tempfile
import shutil
import feedparser
from multiprocessing import Process
I'd like help to figure out why my code does not work, of if I'm missing something elsewhere to make the Process works. 我想帮助您弄清楚为什么我的代码无法正常工作,或者是否想在其他地方使流程正常工作。 Also open to suggestions for improving the code.
也欢迎提出改进代码的建议。 Below is ' decSequenceToRNA ' for reference:
以下是“ decSequenceToRNA ”供参考:
def decSequenceToRNA(_idxSeq, _byteSequence, _rnaSequences):
rnaSequence = ''
printAndLog("!", "Processing block %d (%d bytes)" % (_idxSeq, len(_byteSequence)))
for b in _byteSequence:
rnaSequence = rnaSequence + base10ToRNA(ord(b))
printAndLog("+", "Block %d completed. RNA of %d nucleotides generated." % (_idxSeq, len(rnaSequence)))
_rnaSequences.append((_idxSeq, rnaSequence))
decSequenceToRNA
is running in its own process, which means it gets its own, separate copy of every data structure in the main process. decSequenceToRNA
在自己的进程中运行,这意味着它在主进程中获得每个数据结构的单独副本。 That means that when you append to _rnaSequences
in decSequenceToRNA
, it's has no effect on rnaSequences
in the parent process. 这意味着,当你追加到
_rnaSequences
在decSequenceToRNA
,它有没有影响rnaSequences
父进程。 That would explain why an empty list is being returned. 那可以解释为什么返回一个空列表。
You have two options to address this. 您有两种选择可以解决此问题。 First, is to create a
list
that can be shared between processes using multiprocessing.Manager
. 首先,创建一个
list
,该list
可以使用multiprocessing.Manager
在进程之间共享。 For example: 例如:
import multiprocessing
def f(shared_list):
shared_list.append(1)
if __name__ == "__main__":
normal_list = []
p = multiprocessing.Process(target=f, args=(normal_list,))
p.start()
p.join()
print(normal_list)
m = multiprocessing.Manager()
shared_list = m.list()
p = multiprocessing.Process(target=f, args=(shared_list,))
p.start()
p.join()
print(shared_list)
Output: 输出:
[] # Normal list didn't work, the appended '1' didn't make it to the main process
[1] # multiprocessing.Manager() list works fine
Applying this to your code would just require replacing 将此应用于您的代码只需要替换
rnaSequences = []
With 同
m = multiprocessing.Manager()
rnaSequences = m.list()
Alternatively, you could (and probably should) use a multiprocessing.Pool
instead of creating individual Process
for each chunk. 或者,您可以(可能应该)使用
multiprocessing.Pool
而不是为每个块创建单独的Process
。 I'm not sure how large hFile
is or how big the chunks you're reading are, but if there are more than multiprocessing.cpu_count()
chunks, you're going to hurt performance by spawning processes for every chunk. 我不确定
hFile
大小或正在读取的块的大小,但是如果多于multiprocessing.cpu_count()
块,则会通过为每个块生成进程来损害性能。 Using a Pool
, you can keep your process count constant, and easily create your rnaSequence
list: 使用
Pool
,可以使进程计数保持不变,并轻松创建rnaSequence
列表:
def decSequenceToRNA(_idxSeq, _byteSequence):
rnaSequence = ''
printAndLog("!", "Processing block %d (%d bytes)" % (_idxSeq, len(_byteSequence)))
for b in _byteSequence:
rnaSequence = rnaSequence + base10ToRNA(ord(b))
printAndLog("+", "Block %d completed. RNA of %d nucleotides generated." % (_idxSeq, len(rnaSequence)))
return _idxSeq, rnaSequence
def fileToRNAString(_file):
rnaSequences = []
if (_file and os.path.isfile(_file)):
blockCount = 0
blockSize = 2048
printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
results = []
p = multiprocessing.Pool() # Creates a pool of cpu_count() processes
with open(_file, "rb") as hFile:
buf = hFile.read(blockSize)
while buf:
result = pool.apply_async(decSequenceToRNA, blockCount, buf)
results.append(result)
blockCount = blockCount + 1
buf = hFile.read(blockSize)
rnaSequences = [r.get() for r in results]
pool.close()
pool.join()
else:
printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
return rnaSequences
Note that we no longer pass the rnaSequences
list to the child. 请注意,我们不再将
rnaSequences
列表传递给子级。 Instead, we just return the result we would have appened back to the parent (which we can't do with Process
), and build the list there. 取而代之的是,我们只是将本应返回的结果返回给父级(我们不能使用
Process
),并在那里建立列表。
尝试编写此代码(参数列表末尾的逗号)
p = Process(target=decSequenceToRNA, args=(blockCount, buf, rnaSequences,))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.