I have a function which reads a binary file and converts each byte into a corresponding sequence of characters. For example, 0x05 becomes 'AACC', 0x2A becomes 'AGGG' etc...The function which reads the file and converts the bytes is currently a linear one and since the files to convert are anywhere between 25kb and 2Mb, this can take quite a while.
Therefore, I'm trying to use multiprocessing to divide the task and hopefully improve speed. However, I just can't get it to work. Below is the linear function, which works, albeit slowly;
def fileToRNAString(_file):
if (_file and os.path.isfile(_file)):
rnaSequences = []
blockCount = 0
blockSize = 2048
printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
with open(_file, "rb") as hFile:
buf = hFile.read(blockSize)
while buf:
decSequenceToRNA(blockCount, buf, rnaSequences)
blockCount = blockCount + 1
buf = hFile.read(blockSize)
else:
printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
return rnaSequences
Note: The function ' decSequenceToRNA ' takes the buffer read and converts each byte to the required string. Upon execution, the function returns a tuple which contain the block number and the string, eg (1, 'ACCGTAGATTA...') and at the end, I have an array of these tuples available.
I've tried to convert the function to use the multiprocessing of Python;
def fileToRNAString(_file):
rnaSequences = []
if (_file and os.path.isfile(_file)):
blockCount = 0
blockSize = 2048
printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
workers = []
with open(_file, "rb") as hFile:
buf = hFile.read(blockSize)
while buf:
p = Process(target=decSequenceToRNA, args=(blockCount, buf, rnaSequences))
p.start()
workers.append(p)
blockCount = blockCount + 1
buf = hFile.read(blockSize)
for p in workers:
p.join()
else:
printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
return rnaSequences
However, no processes seems to even start, as when this function is ran, an empty array is returned. Any message printed to the console in ' decSequenceToRNA ' is not displayed;
>>>fileToRNAString(testfile)
[!] Converting /root/src/amino56/M1H2.bin into RNA string (2048 bytes/block).
Unlike this question here, I'm running Linux shiva 3.14-kali1-amd64 #1 SMP Debian 3.14.5-1kali1 (2014-06-07) x86_64 GNU/Linux and using PyCrust to test the functions on Python Version: 2.7.3. I'm using the following packages:
import os
import re
import sys
import urllib2
import requests
import logging
import hashlib
import argparse
import tempfile
import shutil
import feedparser
from multiprocessing import Process
I'd like help to figure out why my code does not work, of if I'm missing something elsewhere to make the Process works. Also open to suggestions for improving the code. Below is ' decSequenceToRNA ' for reference:
def decSequenceToRNA(_idxSeq, _byteSequence, _rnaSequences):
rnaSequence = ''
printAndLog("!", "Processing block %d (%d bytes)" % (_idxSeq, len(_byteSequence)))
for b in _byteSequence:
rnaSequence = rnaSequence + base10ToRNA(ord(b))
printAndLog("+", "Block %d completed. RNA of %d nucleotides generated." % (_idxSeq, len(rnaSequence)))
_rnaSequences.append((_idxSeq, rnaSequence))
decSequenceToRNA
is running in its own process, which means it gets its own, separate copy of every data structure in the main process. That means that when you append to _rnaSequences
in decSequenceToRNA
, it's has no effect on rnaSequences
in the parent process. That would explain why an empty list is being returned.
You have two options to address this. First, is to create a list
that can be shared between processes using multiprocessing.Manager
. For example:
import multiprocessing
def f(shared_list):
shared_list.append(1)
if __name__ == "__main__":
normal_list = []
p = multiprocessing.Process(target=f, args=(normal_list,))
p.start()
p.join()
print(normal_list)
m = multiprocessing.Manager()
shared_list = m.list()
p = multiprocessing.Process(target=f, args=(shared_list,))
p.start()
p.join()
print(shared_list)
Output:
[] # Normal list didn't work, the appended '1' didn't make it to the main process
[1] # multiprocessing.Manager() list works fine
Applying this to your code would just require replacing
rnaSequences = []
With
m = multiprocessing.Manager()
rnaSequences = m.list()
Alternatively, you could (and probably should) use a multiprocessing.Pool
instead of creating individual Process
for each chunk. I'm not sure how large hFile
is or how big the chunks you're reading are, but if there are more than multiprocessing.cpu_count()
chunks, you're going to hurt performance by spawning processes for every chunk. Using a Pool
, you can keep your process count constant, and easily create your rnaSequence
list:
def decSequenceToRNA(_idxSeq, _byteSequence):
rnaSequence = ''
printAndLog("!", "Processing block %d (%d bytes)" % (_idxSeq, len(_byteSequence)))
for b in _byteSequence:
rnaSequence = rnaSequence + base10ToRNA(ord(b))
printAndLog("+", "Block %d completed. RNA of %d nucleotides generated." % (_idxSeq, len(rnaSequence)))
return _idxSeq, rnaSequence
def fileToRNAString(_file):
rnaSequences = []
if (_file and os.path.isfile(_file)):
blockCount = 0
blockSize = 2048
printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
results = []
p = multiprocessing.Pool() # Creates a pool of cpu_count() processes
with open(_file, "rb") as hFile:
buf = hFile.read(blockSize)
while buf:
result = pool.apply_async(decSequenceToRNA, blockCount, buf)
results.append(result)
blockCount = blockCount + 1
buf = hFile.read(blockSize)
rnaSequences = [r.get() for r in results]
pool.close()
pool.join()
else:
printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
return rnaSequences
Note that we no longer pass the rnaSequences
list to the child. Instead, we just return the result we would have appened back to the parent (which we can't do with Process
), and build the list there.
尝试编写此代码(参数列表末尾的逗号)
p = Process(target=decSequenceToRNA, args=(blockCount, buf, rnaSequences,))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.