简体   繁体   中英

Python/Multiprocessing : Processes does not seem to start

I have a function which reads a binary file and converts each byte into a corresponding sequence of characters. For example, 0x05 becomes 'AACC', 0x2A becomes 'AGGG' etc...The function which reads the file and converts the bytes is currently a linear one and since the files to convert are anywhere between 25kb and 2Mb, this can take quite a while.

Therefore, I'm trying to use multiprocessing to divide the task and hopefully improve speed. However, I just can't get it to work. Below is the linear function, which works, albeit slowly;

def fileToRNAString(_file):

    if (_file and os.path.isfile(_file)):
        rnaSequences = []
        blockCount = 0
        blockSize = 2048
        printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
        with open(_file, "rb") as hFile:
            buf = hFile.read(blockSize)
            while buf:
                decSequenceToRNA(blockCount, buf, rnaSequences)
                blockCount = blockCount + 1
                buf = hFile.read(blockSize)
    else:
        printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
    return rnaSequences

Note: The function ' decSequenceToRNA ' takes the buffer read and converts each byte to the required string. Upon execution, the function returns a tuple which contain the block number and the string, eg (1, 'ACCGTAGATTA...') and at the end, I have an array of these tuples available.

I've tried to convert the function to use the multiprocessing of Python;

def fileToRNAString(_file):
    rnaSequences = []
    if (_file and os.path.isfile(_file)):
        blockCount = 0
        blockSize = 2048
        printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
        workers = []
        with open(_file, "rb") as hFile:
            buf = hFile.read(blockSize)
            while buf:
                p = Process(target=decSequenceToRNA, args=(blockCount, buf, rnaSequences))
                p.start()
                workers.append(p)
                blockCount = blockCount + 1
                buf = hFile.read(blockSize)
        for p in workers:
            p.join()
    else:
        printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
    return rnaSequences

However, no processes seems to even start, as when this function is ran, an empty array is returned. Any message printed to the console in ' decSequenceToRNA ' is not displayed;

>>>fileToRNAString(testfile)
[!] Converting /root/src/amino56/M1H2.bin into RNA string (2048 bytes/block).

Unlike this question here, I'm running Linux shiva 3.14-kali1-amd64 #1 SMP Debian 3.14.5-1kali1 (2014-06-07) x86_64 GNU/Linux and using PyCrust to test the functions on Python Version: 2.7.3. I'm using the following packages:

import os
import re
import sys
import urllib2
import requests
import logging
import hashlib
import argparse
import tempfile
import shutil
import feedparser
from multiprocessing import Process

I'd like help to figure out why my code does not work, of if I'm missing something elsewhere to make the Process works. Also open to suggestions for improving the code. Below is ' decSequenceToRNA ' for reference:

def decSequenceToRNA(_idxSeq, _byteSequence, _rnaSequences):
    rnaSequence = ''
    printAndLog("!", "Processing block %d (%d bytes)" % (_idxSeq, len(_byteSequence)))
    for b in _byteSequence:
        rnaSequence = rnaSequence + base10ToRNA(ord(b))
    printAndLog("+", "Block %d completed. RNA of %d nucleotides generated." % (_idxSeq, len(rnaSequence)))
    _rnaSequences.append((_idxSeq, rnaSequence))

decSequenceToRNA is running in its own process, which means it gets its own, separate copy of every data structure in the main process. That means that when you append to _rnaSequences in decSequenceToRNA , it's has no effect on rnaSequences in the parent process. That would explain why an empty list is being returned.

You have two options to address this. First, is to create a list that can be shared between processes using multiprocessing.Manager . For example:

import multiprocessing

def f(shared_list):
    shared_list.append(1)

if __name__ == "__main__":
    normal_list = []
    p = multiprocessing.Process(target=f, args=(normal_list,))
    p.start()
    p.join()
    print(normal_list)

    m = multiprocessing.Manager()
    shared_list = m.list()
    p = multiprocessing.Process(target=f, args=(shared_list,))
    p.start()
    p.join()
    print(shared_list)

Output:

[]   # Normal list didn't work, the appended '1' didn't make it to the main process
[1]  # multiprocessing.Manager() list works fine

Applying this to your code would just require replacing

rnaSequences = []

With

m = multiprocessing.Manager()
rnaSequences = m.list()

Alternatively, you could (and probably should) use a multiprocessing.Pool instead of creating individual Process for each chunk. I'm not sure how large hFile is or how big the chunks you're reading are, but if there are more than multiprocessing.cpu_count() chunks, you're going to hurt performance by spawning processes for every chunk. Using a Pool , you can keep your process count constant, and easily create your rnaSequence list:

def decSequenceToRNA(_idxSeq, _byteSequence):
    rnaSequence = ''
    printAndLog("!", "Processing block %d (%d bytes)" % (_idxSeq, len(_byteSequence)))
    for b in _byteSequence:
        rnaSequence = rnaSequence + base10ToRNA(ord(b))
    printAndLog("+", "Block %d completed. RNA of %d nucleotides generated." % (_idxSeq, len(rnaSequence)))
    return _idxSeq, rnaSequence

def fileToRNAString(_file):
    rnaSequences = []
    if (_file and os.path.isfile(_file)):
        blockCount = 0
        blockSize = 2048
        printAndLog("!", "Converting %s into RNA string (%d bytes/block)" % (_file, blockSize))
        results = []
        p = multiprocessing.Pool()  # Creates a pool of cpu_count() processes
        with open(_file, "rb") as hFile:
            buf = hFile.read(blockSize)
            while buf:
                result = pool.apply_async(decSequenceToRNA, blockCount, buf)
                results.append(result)
                blockCount = blockCount + 1
                buf = hFile.read(blockSize)
        rnaSequences = [r.get() for r in results]
        pool.close()
        pool.join()
    else:
        printAndLog("-", "Could not find the specified file. Please verify that the file exists:" + _file)
    return rnaSequences

Note that we no longer pass the rnaSequences list to the child. Instead, we just return the result we would have appened back to the parent (which we can't do with Process ), and build the list there.

尝试编写此代码(参数列表末尾的逗号)

p = Process(target=decSequenceToRNA, args=(blockCount, buf, rnaSequences,))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM