简体   繁体   English

简单可行的多处理示例

[英]Simple workable example of multiprocessing

I am looking for a simple example of python multiprocessing . 我正在寻找python multiprocessing的简单示例。

I am trying to figure out workable example of python multiprocessing . 我试图找出可行的python multiprocessing示例。 I have found an example on breaking large numbers into primes. 我找到了一个将大量分解为素数的例子。 That worked because there was little input (one large number per core) and lot of computing (breaking the numbers into primes). 之所以可行,是因为几乎没有输入(每个内核有大量输入),却有很多计算(将数字分解为质数)。

However, my interest is different - I have lot of input data on which I perform simple calculations. 但是,我的兴趣有所不同-我有很多输入数据,可以根据这些输入数据执行简单的计算。 I wonder if there is a simple way to modify the below code so that multicores really beats single core. 我想知道是否有一种简单的方法来修改以下代码,以使多核真正胜过单核。 I am running python 3.6 on Win10 machine with 4 physical cores and 16 GB RAM. 我在具有4个物理核心和16 GB RAM的Win10计算机上运行python 3.6。

Here comes my sample code. 这是我的示例代码。

import numpy as np
import multiprocessing as mp
import timeit

# comment the following line to get version without queue
queue = mp.Queue()
cores_no = 4


def npv_zcb(bnd_info, cores_no):

     bnds_no = len(bnd_info)
     npvs = []

     for bnd_idx in range(bnds_no):

         nom = bnd_info[bnd_idx][0]
         mat = bnd_info[bnd_idx][1]
         yld = bnd_info[bnd_idx][2]

         npvs.append(nom / ((1 + yld) ** mat))

     if cores_no == 1:
         return npvs
     # comment the following two lines to get version without queue
     else:
         queue.put(npvs)

# generate random attributes of zero coupon bonds

print('Generating random zero coupon bonds...')


bnds_no = 100

bnd_info = np.zeros([bnds_no, 3])
bnd_info[:, 0] = np.random.randint(1, 31, size=bnds_no)
bnd_info[:, 1] = np.random.randint(70, 151, size=bnds_no)
bnd_info[:, 2] = np.random.randint(0, 100, size=bnds_no) / 100
bnd_info = bnd_info.tolist()

# single core
print('Running single core...')
start = timeit.default_timer()
npvs = npv_zcb(bnd_info, 1)
print('   elapsed time: ', timeit.default_timer() - start, ' seconds')

# multiprocessing
print('Running multiprocessing...')
print('   ', cores_no, ' core(s)...')
start = timeit.default_timer()

processes = []

idx = list(range(0, bnds_no, int(bnds_no / cores_no)))
idx.append(bnds_no + 1)

for core_idx in range(cores_no):
     input_data = bnd_info[idx[core_idx]: idx[core_idx + 1]]

     process = mp.Process(target=npv_zcb,
                          args=(input_data, cores_no))
     processes.append(process)
     process.start()

for process_aux in processes:
     process_aux.join()

# comment the following three lines to get version without queue
mylist = []
while not queue.empty():
     mylist.append(queue.get())

print('   elapsed time: ', timeit.default_timer() - start, ' seconds')

I would be very grateful if anyone could advice me how to modify the code so that multiple core run beats single core run. 如果有人可以建议我如何修改代码,使多核运行优于单核运行,我将不胜感激。 I have also noticed that increasing variable bnds_no to 1,000 leads to BrokenPipeError . 我还注意到,将变量bnds_no增加到1,000会导致BrokenPipeError One would expect that increasing amount of input would lead to longer computational time rather than an error... What is wrong here? 人们会期望输入量的增加将导致更长的计算时间而不是一个错误……这是怎么回事?

The BrokenPipeError is not due to larger input but it is due to race condition which occurres due to the use of queue.empty() and queue.get() in separate steps. BrokenPipeError不是由于输入较大而引起的,是由于竞争条件而发生的,这是由于在单独的步骤中使用queue.empty()queue.get()queue.get()

You don't see it with smaller inputs for most the times is because the queue items get processed pretty fast and race condition does not occur but with larger datasets the chances of race condition increases. 在大多数情况下,您不会看到使用较小的输入的原因是因为队列项的处理非常快,并且不会发生竞争条件,但是对于较大的数据集,竞争条件的机会会增加。

Even with smaller inputs try running your script multiple times, maybe 10 15 times and you will see BrokenPipeError occurring. 即使输入较小,也可以尝试多次运行脚本,也许10到15次,并且您会看到BrokenPipeError发生。

One solution to this is to pass a sentinel value to the queue which you can use to test if all the data in the queue has been processed. 一种解决方案是将哨兵值传递到队列,您可以使用该值来测试队列中的所有数据是否已处理。

Try modifying your code to something like this 尝试将您的代码修改为这样的内容

q = mp.Queue()
 <put the data in the queue>
 q.put(None)


while True:
    data = q.get()
    if data is not None:
        <process the data here >
    else:
        q.put(None)
        return

This doesn't directly answer your question but if you were using RxPy for reactive Python programming you could check out their small example on multiprocessing: https://github.com/ReactiveX/RxPY/tree/release/v1.6.x#concurrency 这并不能直接回答您的问题,但是如果您使用RxPy进行反应性Python编程,则可以查看其有关多处理的小示例: https : //github.com/ReactiveX/RxPY/tree/release/v1.6.x#并发

Seems a bit easier to manage concurrency with ReactiveX/RxPy than trying to do it manually. 用ReactiveX / RxPy管理并发似乎比尝试手动进行要容易一些。

OK, so I removed queue related parts from the code to see if get rid of the BrokenPipeError (above I updated the original code indicating what should be commented out). OK,所以我从代码中删除了与队列相关的部分,以查看是否摆脱了BrokenPipeError (以上我更新了原始代码,指出了应该注释掉的内容)。 Unfortunately it did not help. 不幸的是,它没有帮助。

I tested the code on my personal PC with Linux (Ubuntu 18.10, python 3.6.7). 我在装有Linux(Ubuntu 18.10,python 3.6.7)的个人PC上测试了代码。 Quite surprisingly the code behaves differently on the two systems. 令人惊讶的是,代码在两个系统上的行为不同。 On Linux the version without queue runs without problems; 在Linux上,没有队列的版本可以正常运行。 the version with queue runs forever. 带队列的版本将永远运行。 On Windows there is no difference - I always end up with BrokenPipeError . 在Windows上没有区别-我总是以BrokenPipeError

PS: In some other post ( No multiprocessing print outputs (Spyder) ) I found that there might be some problem with multiprocessing when using Spyder editor. PS:在其他一些文章中( 没有多处理打印输出(Spyder) ),我发现使用Spyder编辑器时多处理可能存在一些问题。 I experienced exactly the same problem on Windows machine. 我在Windows计算机上遇到了完全相同的问题。 So, not all examples in official documentation work as expected... 因此,并非官方文档中的所有示例都能按预期工作...

This doesn't answer your question—I'm only posting it to illustrate what I was said in comments about when multiprocessing might be able to speed processing up. 这并不能回答您的问题-我只是将其发布以说明我在评论中所说的关于何时可以加快处理速度的内容。

In the code below which is based on yours, I've added a REPEAT constant that makes the npv_zcb() do its computations over again that many times to simulate it using the CPU more. 在下面基于您的代码中,我添加了一个REPEAT常量,该常量使npv_zcb() REPEAT进行npv_zcb()计算,以更多地使用CPU对其进行仿真。 Changing this constant's value generally slows-down or speeds-up the single core processing much more than it does the multiprocessing part — in-fact it hardly affects the latter at all. 更改此常数的值通常会比单核处理放慢或加快多核处理部分的速度,实际上,它几乎不会影响多核处理部分。

import numpy as np
import multiprocessing as mp
import timeit


np.random.seed(42)  # Generate same set of random numbers for testing.

REPEAT = 10  # Number of times to repeat computations performed in npv_zcb.


def npv_zcb(bnd_info, queue):

    npvs = []

    for _ in range(REPEAT):  # To simulate more computations.

        for bnd_idx in range(len(bnd_info)):

            nom = bnd_info[bnd_idx][0]
            mat = bnd_info[bnd_idx][1]
            yld = bnd_info[bnd_idx][2]
            v = nom / ((1 + yld) ** mat)

    npvs.append(v)

    if queue:
        queue.put(npvs)
    else:
        return npvs


if __name__ == '__main__':

    print('Generating random zero coupon bonds...')
    print()

    bnds_no = 100
    cores_no = 4

    # generate random attributes of zero coupon bonds

    bnd_info = np.zeros([bnds_no, 3])
    bnd_info[:, 0] = np.random.randint(1, 31, size=bnds_no)
    bnd_info[:, 1] = np.random.randint(70, 151, size=bnds_no)
    bnd_info[:, 2] = np.random.randint(0, 100, size=bnds_no) / 100
    bnd_info = bnd_info.tolist()

    # single core
    print('Running single core...')
    start = timeit.default_timer()
    npvs = npv_zcb(bnd_info, None)
    print('   elapsed time: {:.6f} seconds'.format(timeit.default_timer() - start))

    # multiprocessing
    print()
    print('Running multiprocessing...')
    print('  ', cores_no, ' core(s)...')
    start = timeit.default_timer()

    queue = mp.Queue()
    processes = []

    idx = list(range(0, bnds_no, int(bnds_no / cores_no)))
    idx.append(bnds_no + 1)

    for core_idx in range(cores_no):
        input_data = bnd_info[idx[core_idx]: idx[core_idx + 1]]

        process = mp.Process(target=npv_zcb, args=(input_data, queue))
        processes.append(process)
        process.start()

    for process in processes:
        process.join()

    mylist = []
    while not queue.empty():
        mylist.extend(queue.get())

    print('   elapsed time: {:.6f} seconds'.format(timeit.default_timer() - start))

OK - so finally I found an answer. 好-终于找到了答案。 Multiprocessing does not work on Windows. 多重处理在Windows上不起作用。 The following code runs fine on Ubuntu (Ubuntu 19.04 & python 3.7) but not on Windows (Win10 & python 3.6). 以下代码可以在Ubuntu(Ubuntu 19.04&python 3.7)上正常运行,但不能在Windows(Win10&python 3.6)上正常运行。 Hope it helps others... 希望它能帮助别人...

import pandas as pd
import numpy as np
import csv
import multiprocessing as mp
import timeit


def npv_zcb(bnd_file, delimiter=','):
    """
    Michal Mackanic
    06/05/2019 v1.0

    Load bond positions from a .csv file, value the bonds and save results
    back to a .csv file.

    inputs:
        bnd_file: str
            full path to a .csv file with bond positions
        delimiter: str
            delimiter to be used in .csv file
    outputs:
        a .csv file with additional field npv.

    dependencies:

    example:
        npv_zcb('C:\\temp\\bnd_aux.csv', ',')
    """

    # load the input file as a dataframe
    bnd_info = pd.read_csv(bnd_file,
                           sep=delimiter,
                           quoting=2,  # csv.QUOTE_NONNUMERIC
                           doublequote=True,
                           low_memory=False)

    # convert dataframe into list of dictionaries
    bnd_info = bnd_info.to_dict(orient='records')

    # get number of bonds in the file
    bnds_no = len(bnd_info)

    # go bond by bond
    for bnd_idx in range(bnds_no):
        mat = bnd_info[bnd_idx]['maturity']
        nom = bnd_info[bnd_idx]['nominal']
        yld = bnd_info[bnd_idx]['yld']
        bnd_info[bnd_idx]['npv'] = nom / ((1 + yld) ** mat)

    # covert list of dictionaries back to dataframe and save it as .csv file
    bnd_info = pd.DataFrame(bnd_info)
    bnd_info.to_csv(bnd_file,
                    sep=delimiter,
                    quoting=csv.QUOTE_NONNUMERIC,
                    quotechar='"',
                    index=False)

    return


def main(cores_no, bnds_no, path, delimiter):

    # generate random attributes of zero coupon bonds
    print('Generating random zero coupon bonds...')
    bnd_info = np.zeros([bnds_no, 3])
    bnd_info[:, 0] = np.random.randint(1, 31, size=bnds_no)
    bnd_info[:, 1] = np.random.randint(70, 151, size=bnds_no)
    bnd_info[:, 2] = np.random.randint(0, 100, size=bnds_no) / 100
    bnd_info = zip(bnd_info[:, 0], bnd_info[:, 1], bnd_info[:, 2])
    bnd_info = [{'maturity': mat,
                 'nominal': nom,
                 'yld': yld} for mat, nom, yld in bnd_info]
    bnd_info = pd.DataFrame(bnd_info)

    # save bond positions into a .csv file
    bnd_info.to_csv(path + 'bnd_aux.csv',
                    sep=delimiter,
                    quoting=csv.QUOTE_NONNUMERIC,
                    quotechar='"',
                    index=False)

    # prepare one .csv file per core
    print('Preparing input files...')

    idx = list(range(0, bnds_no, int(bnds_no / cores_no)))
    idx.append(bnds_no + 1)

    for core_idx in range(cores_no):
        # save bond positions into a .csv file
        file_name = path + 'bnd_aux_' + str(core_idx) + '.csv'
        bnd_info_aux = bnd_info[idx[core_idx]: idx[core_idx + 1]]
        bnd_info_aux.to_csv(file_name,
                            sep=delimiter,
                            quoting=csv.QUOTE_NONNUMERIC,
                            quotechar='"',
                            index=False)

    # SINGLE CORE
    print('Running single core...')

    start = timeit.default_timer()

    # evaluate bond positions
    npv_zcb(path + 'bnd_aux.csv', delimiter)

    print('   elapsed time: ', timeit.default_timer() - start, ' seconds')

    # MULTIPLE CORES
    if __name__ == '__main__':

        # spread calculation among several cores
        print('Running multiprocessing...')
        print('   ', cores_no, ' core(s)...')

        start = timeit.default_timer()

        processes = []

        # go core by core
        print('        spreading calculation among processes...')
        for core_idx in range(cores_no):
            # run calculations
            file_name = path + 'bnd_aux_' + str(core_idx) + '.csv'
            process = mp.Process(target=npv_zcb,
                                 args=(file_name, delimiter))
            processes.append(process)
            process.start()

        # wait till every process is finished
        print('        waiting for all processes to finish...')
        for process in processes:
            process.join()

    print('   elapsed time: ', timeit.default_timer() - start, ' seconds')

main(cores_no=2,
     bnds_no=1000000,
     path='/home/macky/',
     delimiter=',')

After some help from a colleague, I was able to produce a simple piece of code that was actually running as expected. 在一位同事的帮助下,我得以编写出实际上按预期运行的简单代码。 I was almost there - my code needed a few subtle (yet crucial) modifications. 我快要在那里了-我的代码需要进行一些微妙(但很关键)的修改。 To run the code, open anaconda prompt, type python -m idlelib , open the file and run it. 要运行代码,请打开anaconda提示符,键入python -m idlelib ,打开文件并运行它。

import pandas as pd
import numpy as np
import csv
import multiprocessing as mp
import timeit


def npv_zcb(core_idx, bnd_file, delimiter=','):
    """
    Michal Mackanic
    06/05/2019 v1.0

    Load bond positions from a .csv file, value the bonds and save results
    back to a .csv file.

    inputs:
        bnd_file: str
            full path to a .csv file with bond positions
        delimiter: str
            delimiter to be used in .csv file
    outputs:
        a .csv file with additional field npv.

    dependencies:

    example:
        npv_zcb('C:\\temp\\bnd_aux.csv', ',')
    """

    # core idx
    print('   npv_zcb() starting on core ' + str(core_idx))

    # load the input file as a dataframe
    bnd_info = pd.read_csv(bnd_file,
                           sep=delimiter,
                           quoting=2,  # csv.QUOTE_NONNUMERIC
                           header=0,
                           doublequote=True,
                           low_memory=False)

    # convert dataframe into list of dictionaries
    bnd_info = bnd_info.to_dict(orient='records')

    # get number of bonds in the file
    bnds_no = len(bnd_info)

    # go bond by bond
    for bnd_idx in range(bnds_no):
        mat = bnd_info[bnd_idx]['maturity']
        nom = bnd_info[bnd_idx]['nominal']
        yld = bnd_info[bnd_idx]['yld']
        bnd_info[bnd_idx]['npv'] = nom / ((1 + yld) ** mat)

    # covert list of dictionaries back to dataframe and save it as .csv file
    bnd_info = pd.DataFrame(bnd_info)
    bnd_info.to_csv(bnd_file,
                    sep=delimiter,
                    quoting=csv.QUOTE_NONNUMERIC,
                    quotechar='"',
                    index=False)

    # core idx
    print('   npv_zcb() finished on core ' + str(core_idx))

    # everything OK
    return True


def main(cores_no, bnds_no, path, delimiter):

    if __name__ == '__main__':
        mp.freeze_support()

        # generate random attributes of zero coupon bonds
        print('Generating random zero coupon bonds...')
        bnd_info = np.zeros([bnds_no, 3])
        bnd_info[:, 0] = np.random.randint(1, 31, size=bnds_no)
        bnd_info[:, 1] = np.random.randint(70, 151, size=bnds_no)
        bnd_info[:, 2] = np.random.randint(0, 100, size=bnds_no) / 100
        bnd_info = zip(bnd_info[:, 0], bnd_info[:, 1], bnd_info[:, 2])
        bnd_info = [{'maturity': mat,
                     'nominal': nom,
                     'yld': yld} for mat, nom, yld in bnd_info]
        bnd_info = pd.DataFrame(bnd_info)

        # save bond positions into a .csv file
        bnd_info.to_csv(path + 'bnd_aux.csv',
                        sep=delimiter,
                        quoting=csv.QUOTE_NONNUMERIC,
                        quotechar='"',
                        index=False)

        # prepare one .csv file per core
        print('Preparing input files...')

        idx = list(range(0, bnds_no, int(bnds_no / cores_no)))
        idx.append(bnds_no + 1)

        for core_idx in range(cores_no):
            # save bond positions into a .csv file
            file_name = path + 'bnd_aux_' + str(core_idx) + '.csv'
            bnd_info_aux = bnd_info[idx[core_idx]: idx[core_idx + 1]]
            bnd_info_aux.to_csv(file_name,
                                sep=delimiter,
                                quoting=csv.QUOTE_NONNUMERIC,
                                quotechar='"',
                                index=False)

        # SINGLE CORE
        print('Running single core...')

        start = timeit.default_timer()

        # evaluate bond positions
        npv_zcb(1, path + 'bnd_aux.csv', delimiter)

        print('   elapsed time: ', timeit.default_timer() - start, ' seconds')

        # MULTIPLE CORES
        # spread calculation among several cores
        print('Running multiprocessing...')
        print('   ', cores_no, ' core(s)...')

        start = timeit.default_timer()

        processes = []

        # go core by core
        print('        spreading calculation among processes...')
        for core_idx in range(cores_no):
            # run calculations
            file_name = path + 'bnd_aux_' + str(core_idx) + '.csv'
            process = mp.Process(target=npv_zcb,
                                     args=(core_idx, file_name, delimiter))
            processes.append(process)
            process.start()

        # wait till every process is finished
        print('        waiting for all processes to finish...')
        for process in processes:
            process.join()

        print('   elapsed time: ', timeit.default_timer() - start, ' seconds')


main(cores_no=2,
     bnds_no=1000000,
     path='C:\\temp\\',
     delimiter=',')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM