Running program on multiple cores

Question

I am running a program in Python using threading to parallelise the task. The task is simple string matching, I am matching a large number of short strings to a database of long strings. When I tried to parallelise it, I decided to split the list of short strings into a number of sublists equal to the number of cores and run each of them separately, on a different core. However, when I run the task on 5 or 10 cores, it is about twice slower than just on one core. What could the reason for that be and how can I possibly fix it?

Edit: my code can be seen below

import sys
import os
import csv
import re
import threading
from Queue import Queue
from time import sleep
from threading import Lock


q_in = Queue()
q_out = Queue()
lock = Lock()

def ceil(nu):
    if int(nu) == nu:
        return int(nu)
    else:
        return int(nu) + 1

def opencsv(csvv):
    with open(csvv) as csvfile:
        peptides = []
        reader = csv.DictReader(csvfile)
        k = 0
        lon = ""
        for row in reader:
            pept = str(row["Peptide"])
            pept = re.sub("\((\+\d+\.\d+)\)", "", pept)
            peptides.append(pept)
        return peptides

def openfasta(fast):
    with open(fast, "r") as fastafile:
        dic = {}
        for line in fastafile:
            l = line.strip()
            if l[0] == ">":
                cur = l
                dic[l] = ""
            else:
                dic[cur] = dic[cur] + l
        return dic

def match(text, pattern):
    text = list(text.upper())
    pattern = list(pattern.upper())
    ans = []
    cur = 0
    mis = 0
    i = 0
    while True:
        if i == len(text):
            break
        if text[i] != pattern[cur]:
            mis += 1
            if mis > 1:
                mis = 0
                cur = 0
                continue
        cur = cur + 1
        i = i + 1
        if cur == len(pattern):
            ans.append(i - len(pattern))
            cur = 0
            mis = 0
            continue
    return ans

def job(pepts, outfile, genes):
    c = 0
    it = 0
    towrite = []
    for i in pepts:
        # if it % 1000 == 0:
            # with lock:
                # print float(it) / float(len(pepts))
        it = it + 1
        found = 0
        for j in genes:
            m = match(genes[j], i)
            if len(m) > 0:
                found = 1
                remb = m[0]
                wh = j
                c = c + len(m)
                if c > 1:
                    found = 0
                    c = 0
                    break
        if found == 1:
            towrite.append("\t".join([i, str(remb), str(wh)]) + "\n")
    return towrite


def worker(outfile, genes):
    s = q_in.qsize()
    while True:
        item = q_in.get()
        print "\r{0:.2f}%".format(1 - float(q_in.qsize()) / float(s))
        if item is None:
            break #kill thread
        pepts = item
        q_out.put(job(pepts, outfile, genes))
        q_in.task_done()

def main(args):
    num_worker_threads = int(args[4])

    pept = opencsv(args[1])
    l = len(pept)
    howman = num_worker_threads
    ll = ceil(float(l) / float(howman * 100))
    remain = pept
    pepties = []
    while len(remain) > 0:
        pepties.append(remain[0:ll])
        remain = remain[ll:]
    for i in pepties:
        print len(i)
    print l

    print "Csv file loaded..."
    genes = openfasta(args[2])
    out = args[3]
    print "Fasta file loaded..."

    threads = []

    with open(out, "w") as outfile:
        for pepts in pepties:
            q_in.put(pepts)

        for i in range(num_worker_threads):
            t = threading.Thread(target=worker, args=(outfile, genes, ))
            # t.daemon = True
            t.start()
            threads.append(t)

        q_in.join() # run workers

        # stop workers
        for _ in range(num_worker_threads):
            q_in.put(None)
        for t in threads:
            t.join()
            # print(t)

    return 0
if __name__ == "__main__":
  sys.exit(main(sys.argv))

The important part of the code is within the job function, where short sequences in pepts get matched to long sequences in genes.

Answer 1

This should be because of GIL (Global Interpreter Lock) in CPython.

In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native threads from executing Python bytecodes at once.

David Beazley's presentation at PyCon 2010 gave a detailed explanation about GIL. And from page 32 to page 34, he explained why the same multiple-threading code (of CPU-bound computation) could have worse performance when running with multiple cores than when running with single core.

(with single core) Threads alternate execution, but switch far less frequently than you might imagine

With multiple cores, runnable threads get scheduled simultaneously (on different cores) and battle over the GIL

David's this experiment result visualizes "how thread switching gets more rapid as the number of CPUs increases".

Even though your job function contains some I/O, according to its 3-level nested loops (two in job and one in match ), it is more like CPU-bound computation.

Changing your code to multiple-processing will help you utilize multiple cores and may improve the performance. However , how much you could gain depends on the quantity of the computation - whether the benefit from parallelizing the computation could far surpass the overhead introduced by multiple-processing such as inter-process communication.

Running program on multiple cores

Question

1 answers

solution1
0 2016-05-10 13:38:42

Running program on multiple cores

Question

1 answers

solution1 0 2016-05-10 13:38:42

solution1
0 2016-05-10 13:38:42