简体   繁体   中英

Multiprocessing with external programs - speed of execution

I need to run a relatively slow external program a couple million times on my data. This external program is called RNAup, a program for determining the binding energy between two RNAs. In many cases, it takes up to ten to fifteen minutes per RNA-RNA pair. This is far too slow to run sequentially on a couple million rows of data, so I've decided to speed up the process by running the program in parallel as much as possible. However, it's still far too slow. Below is how I've parallelized its use:

import subprocess
import multiprocessing as mult
import uuid

def energy(seq, name):
    for item in seq:
        item.append([]) # adding new list to house the energy information

        stdin = open("stdin" + name + ".in", "w")
        stdin.write(item)
        stdin.close()
        stdin = open("stdin" + name + ".in", "r") # bug: this line is required to prevent bizarre results. maybe to slow down something? time.sleep()ing is no good, you must access this specific file for some reason!
        stdout = open("stdout" + name + "out", "w")

        subprocess.call("RNAup < stdin" + name + ".in > stdout" + name + ".out", shell=True) # RNAup call slightly modified for brevity and clarity of understanding
        stdout.close()

        stdout = open("stdout" + name + ".out", "r")
        for line in stdout:
            item[-1].append(line)
        stdout.close()
    return seq

def intermediate(seq):
    name = str(uuid.uuid4()) # give each item in the array a different ID on disk so as to not have to bother with mutexes or any kind of name collisions
    energy(seq, name)

PROCESS_COUNT = mult.cpu_count() * 20 # 4 CPUs, so 80 processes running at any given time
mult.Pool(processes=PROCESS_COUNT).map(intermediate, list_nucleotide_seqs)

How can I dramatically improve the speed of my program? (I would, by the way, accept answers involving transferring part, most, or all of the program to C.) Right now, it would take half a year to get through all my data, which is simply unacceptable, and I need some way of making my program faster.

There's not much you can do here if RNAup is really going to take 10-15 minutes for every line in your input file. That is the bottleneck, not the Python code. Spreading the work of RNAup across all available cores is the best you can do to speed things up with only one machine, and at best that will mean you'll be 4x faster (assuming 4 CPU cores). But if you've got 1 million pairs, you're still looking at 10 minutes x 250,000 sets of runs. Assuming you can't make RNAup faster, it sounds like you'll need to distribute this work across many machines, using Celery , or some other distributed framework.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM