简体   繁体   中英

MultiProcessing slower with more processes

I have a program written in python that reads 4 input text files and writes all of them into a list called ListOutput which is a shared memory between 4 processes used in my program (I used 4 processes so my program runs faster!)

I also have a shared memory variable called processedFiles which stores the names of the already read input files by any of the processes so the current process does not read them again (I used lock so processes do not check the existence of a file inside processedFiles at the same time).

When I only use one process my program runs faster(7 milliseconds) — my computer has 8 cores. Why is this?

import glob
from multiprocessing import Process, Manager,Lock
import timeit
import os

os.chdir("files")
# Define a function for the Processes
def print_content(ProcessName,processedFiles,ListOutput,lock):
   for file in glob.glob("*.txt"):
      newfile=0

      lock.acquire()

      print "\n Current Process:",ProcessName

      if file not in processedFiles:
         print "\n", file, " not in ", processedFiles," for ",ProcessName
         processedFiles.append(file)
         newfile=1 #it is a new file

      lock.release()

      #if it is a new file
      if newfile==1:
         f = open(file,"r")
         lines = f.readlines()
         ListOutput.append(lines)
         f.close()

         #print "%s: %s" % ( ProcessName, time.ctime(time.time()) )

# Create processes as follows
try:
   manager = Manager()
   processedFiles = manager.list()
   ListOutput = manager.list()
   start = timeit.default_timer()

   lock=Lock()
   p1 = Process(target=print_content, args=("Procees-1",processedFiles,ListOutput,lock))
   p2 = Process(target=print_content, args=("Process-2",processedFiles,ListOutput,lock))
   p3 = Process(target=print_content, args=("Process-3",processedFiles,ListOutput,lock))
   p4 = Process(target=print_content, args=("Process-4",processedFiles,ListOutput,lock))

   p1.start()
   p2.start()
   p3.start()
   p4.start()

   p1.join()
   p2.join()
   p3.join()
   p4.join()

   print "ListOutput",ListOutput
   stop = timeit.default_timer()
   print stop - start
except:
   print "Error: unable to start process"

The problem is that what looks like multiprocessing often isn't. Just using more cores doesn't mean doing more work.

The most glaring problem is that you synchronize everything. Selecting files is sequential because you lock, so there is zero gain here. While you are reading in parallel, every line read is written to a shared data structure - which will internally synchronize itself. So the only gain you potentially get is from reading in parallel. Depending on your media, eg an HDD instead of an SSD, the sum of multiple readers is actually slower than a single one.

On top of that is the overhead from managing all those processes. Each one needs to be started. Each one needs to be passed its input. Each one must communicate with the others, which happens for practically every action. And don't be fooled, a Manager is nifty but heavyweight.

So aside from gaining little, you add an additional cost. Since you start out with a very small runtime of just 7ms , that additional cost can be pretty significant.

In general, multiprocessing is only worth it if you are CPU-bound. That is, your CPU efficiency is close to 100%, ie there's more work than what can be done. Generally, this happens when you do lots of computation. Usually, doing mostly I/O is a good indicator that you are not CPU-bound.

Just to add to existing answer, there are certain cases where using multiprocessing really adds value and saves time:

  1. Your program does N tasks, which are independent of each other.
  2. Your program does extensive heavy mathematical calculations
  3. As a gotcha to second point, the computation time must be significantly larger. Otherwise, the cost of creation of new process will overshadow the advantage of multiprocessing and your parallel-processing designed program will run slower than a sequential version.
  4. In the ideal case, if your program does file I/O, network I/O operations, then don't parallelize your program, unless you have very strong reason to do so.
  5. To add to fourth point, if your requirement demands, then you can think of creating a multiprocess.Pool , which will be performing task for you, as needed and eliminates the overhead of creation and destruction of process every time.

Hope it helps you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM