使用单个工作程序进行Python多处理比顺序操作更快

Question

A brief overview - I wrote some random files with lots of random numbers to disc to test the performance of python multiprocessing vs sequential operations. 简要概述-我编写了一些随机文件，其中包含很多随机数以测试光盘python多处理与顺序操作的性能。

Function description 功能说明

putfiles : write test files to drive putfiles ：将测试文件写入驱动器

readFile : reads the passed file location and returns result(sum of numbers in the code) readFile ：读取传递的文件位置并返回结果（代码中数字的总和）

getSequential : reads some files with a for loop getSequential ：使用for循环读取某些文件

getParallel : read file with multiple processes spawned getParallel ：读取具有多个进程的文件

Performance results: (Read and process 100 files, with sequential and process pool) 性能结果：（读取和处理100个文件，以及顺序和处理池）

timeit getSequential(numFiles=100) - around 2.85s best timeit getSequential（numFiles = 100）-最佳约2.85s

timeit getParallel(numFiles=100, numProcesses=4) -around 960ms best timeit getParallel（numFiles = 100，numProcesses = 4）-最佳960ms

timeit getParallel(numFiles=100, numProcesses=1) -around 980ms best timeit getParallel（numFiles = 100，numProcesses = 1）-最佳980ms

Surprisingly single process pool performs better than sequential and at par with 4 process pool. 令人惊讶的是，单个进程池的性能优于顺序进程，并且与4个进程池相当。 Is this behavior expected or am I doing something wrong here? 这是预期的行为还是我在这里做错了什么？

import os
import random
from multiprocessing import Pool

os.chdir('/Users/test/Desktop/filewritetest')

def putfiles(numFiles=5, numCount=100):
    #numFiles = int(input("how many files?: "))
    #numCount = int(input('How many random numbers?: '))
    for num in range(numFiles):
        with open('r' + str(num) + '.txt', 'w') as f:
            f.write("\n".join([str(random.randint(1, 100)) for i in range(numCount)]))

def readFile(fileurl):
    with open(fileurl, 'r') as f, open("ans_" + fileurl, 'w') as fw:
        fw.write(str((sum([int(i) for i in f.read().split()]))))

def getSequential(numFiles=5):
    #in1 = int(input("how many files?: "))
    for num in range(numFiles):
        (readFile('r' + str(num) + '.txt'))


def getParallel(numFiles=5, numProcesses=2):
    #numFiles = int(input("how many files?: ")) 
    #numProcesses = int(input('How many processes?: '))
    with Pool(numProcesses) as p:
        p.map(readFile, ['r' + str(num) + '.txt' for num in range(numFiles)])


#putfiles()

putfiles(numFiles=1000, numCount=100000)

timeit getSequential(numFiles=100)
##around 2.85s best

timeit getParallel(numFiles=100, numProcesses=1)
##around 980ms best
timeit getParallel(numFiles=100, numProcesses=4)
##around 960ms best

Update: in a new session of sypder, I don't see this issue. 更新：在新的sypder会话中，我没有看到此问题。 Updated runtime below 更新了下面的运行时

##100 files
#around 2.97s best
timeit getSequential(numFiles=100)

#around 2.99s best
timeit getParallel(numFiles=100, numProcesses=1)

#around 1.57s best
timeit getParallel(numFiles=100, numProcesses=2)

#around 942ms best
timeit getParallel(numFiles=100, numProcesses=4)

##1000 files
#around 29.3s best
timeit getSequential(numFiles=1000)

#around 11.8s best
timeit getParallel(numFiles=1000, numProcesses=4)

#around 9.6s best
timeit getParallel(numFiles=1000, numProcesses=16)

#around 9.65s best  #let pool choose best default value
timeit getParallel(numFiles=1000)

Answer 1

please do not consider this as an answer, it is for showing you my code when running the stuff in python 3.x (your timeit usage did not work at all for me, i assumed it is 2.x). 请不要认为这是一个答案，它是为了在python 3.x中运行这些东西时向您显示我的代码（您的timeit用法对我来说根本不起作用，我认为它是2.x）。 Sorry but i dont have the time to look into it deeply now. 抱歉，但是我现在没有时间深入研究它。

[EDIT] on a spinning drive, consider disk cache: do not access the same files in different tests or just switch the order of your tests to see if disk cache is involved 在旋转的驱动器上进行[EDIT]，请考虑磁盘缓存：请勿在不同的测试中访问相同的文件，或者只是切换测试顺序以查看是否涉及磁盘缓存

Using the following code, changing manually the numProcesses=X argument, i got these results: 使用以下代码，手动更改numProcesses = X参数，我得到了以下结果：

On SSD, 0.31 seconds for 1000 sequential and 0.37 seconds for 1000 paralell with 1 thread, 0.23 1000 paralell using 4 threads 在SSD上，带1个线程的1000个并行的0.31秒和带1个线程的1000个并行的0.37秒，使用4个线程的0.23个1000并行

import os
import random
import timeit
from multiprocessing import Pool
from contextlib import closing

os.chdir('c:\\temp\\')

def putfiles(numFiles=5, numCount=1):
    #numFiles = int(input("how many files?: "))
    #numCount = int(input('How many random numbers?: '))
    for num in range(numFiles):
        #print("num: " + str(num))
        with open('r' + str(num) + '.txt', 'w') as f:
            f.write("\n".join([str(random.randint(1, 100)) for i in range( numCount )]))
    #print ("pufiles done")

def readFile(fileurl):
    with open(fileurl, 'r') as f, open("ans_" + fileurl, 'w') as fw:
        fw.write(str((sum([int(i) for i in f.read().split()]))))


def getSequential(numFiles=10000):
   # print ("getSequential, nufile: " + str (numFiles))
    #in1 = int(input("how many files?: "))
    for num in range(numFiles): 
        #print ("getseq for")
        (readFile('r' + str(num) + '.txt'))
    #print ("getSequential done")


def getParallel(numFiles=10000, numProcesses=1):
    #numFiles = int(input("how many files?: ")) 
    #numProcesses = int(input('How many processes?: '))
    #readFile, ['r' + str(num) + '.txt' for num in range(numFiles)]
    #with Pool(10) as p:
    with closing(Pool(processes=1)) as p:
       p.map(readFile, ['r' + str(num) + '.txt' for num in range(numFiles)])

if __name__ == '__main__':
    #putfiles(numFiles=10000, numCount=1)

    print (timeit.timeit ("getSequential()","from __main__ import getSequential",number=1))

    print (timeit.timeit ("getParallel()","from __main__ import getParallel",number=1)) 

#timeit (getParallel(numFiles=100, numProcesses=4)) #-around 960ms best

#timeit (getParallel(numFiles=100, numProcesses=1)) #-around 980ms best

使用单个工作程序进行Python多处理比顺序操作更快

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-04-01 15:47:01

使用单个工作程序进行Python多处理比顺序操作更快

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-04-01 15:47:01

解决方案1
0 已采纳 2018-04-01 15:47:01