简体   繁体   English

关于python多处理的初学者问题?

[英]beginner question about python multiprocessing?

I have a number of records in the database I want to process. 我在要处理的数据库中有许多记录。 Basically, I want to run several regex substitution over tokens of the text string rows and at the end, and write them back to the database. 基本上,我想在文本字符串行的标记和结尾处运行几个正则表达式替换,然后将它们写回数据库。

I wish to know whether does multiprocessing speeds up the time required to do such tasks. 我想知道多处理是否加快了执行此类任务所需的时间。 I did a 我做了一个

multiprocessing.cpu_count multiprocessing.cpu_count

and it returns 8. I have tried something like 它返回8.我尝试过类似的东西

process = []
for i in range(4):
    if i == 3:
        limit = resultsSize - (3 * division)
    else:
        limit = division

    #limit and offset indicates the subset of records the function would fetch in the db
    p = Process(target=sub_table.processR,args=(limit,offset,i,))
    p.start()
    process.append(p)
    offset += division + 1

for po in process:
    po.join()

but apparently, the time taken is higher than the time required to run a single thread. 但显然,所花费的时间高于运行单个线程所需的时间。 Why is this so? 为什么会这样? Can someone please enlighten is this a suitable case or what am i doing wrong here? 有人请指教,这是一个合适的案例,或者我在这里做错了什么?

Why is this so? 为什么会这样?

Can someone please enlighten in what cases does multiprocessing gives better performances? 有人可以指导多处理在哪些情况下提供更好的性能?

Here's one trick. 这是一招。

Multiprocessing only helps when your bottleneck is a resource that's not shared. 只有当您的瓶颈是共享的资源时,多处理才有用。

A shared resource (like a database) will be pulled in 8 different directions, which has little real benefit. 共享资源(如数据库)将在8个不同的方向上拉动,这几乎没有什么好处。

To find a non-shared resource, you must have independent objects. 要查找非共享资源,您必须具有独立对象。 Like a list that's already in memory. 就像已经在内存中的列表一样。

If you want to work from a database, you need to get 8 things started which then do no more database work. 如果你想在数据库中工作,你需要启动8件事,然后不再需要数据库工作。 So, a central query that distributes work to separate processors can sometimes be beneficial. 因此,将工作分配到单独的处理器的中央查询有时可能是有益的。

Or 8 different files. 或8个不同的文件。 Note that the file system -- as a whole -- is a shared resource and some kinds of file access are involve sharing something like a disk drive or a directory. 请注意,文件系统 - 作为一个整体 - 是共享资源,某些类型的文件访问涉及共享诸如磁盘驱动器或目录之类的东西。

Or a pipeline of 8 smaller steps. 或者是一个8个较小步骤的管道。 The standard unix pipeline trick query | process1 | process2 | process3 >file 标准的unix管道技巧query | process1 | process2 | process3 >file query | process1 | process2 | process3 >file query | process1 | process2 | process3 >file works better than almost anything else because each stage in the pipeline is completely independent. query | process1 | process2 | process3 >file比几乎其他任何东西都更好,因为管道中的每个阶段都是完全独立的。

Here's the other trick. 这是另一个技巧。

Your computer system (OS, devices, database, network, etc.) is so complex that simplistic theories won't explain performance at all. 您的计算机系统(操作系统,设备,数据库,网络等)非常复杂,简单的理论根本无法解释性能。 You need to (a) take several measurements and (b) try several different algorithms until you understand all the degrees of freedom. 您需要(a)进行多次测量,并(b)尝试几种不同的算法,直到您了解所有自由度。

A question like "Can someone please enlighten in what cases does multiprocessing gives better performances?" 像“有人可以指导多处理在哪些情况下可以提供更好的性能吗?”之类的问题。 doesn't have a simple answer. 没有一个简单的答案。

In order to have a simple answer, you'd need a much, much simpler operating system. 为了得到一个简单的答案,你需要一个更简单的操作系统。 Fewer devices. 更少的设备。 No database and no network, for example. 例如,没有数据库,也没有网络。 Since your OS is complex, there's no simple answer to your question. 由于您的操作系统很复杂,因此您的问题没有简单的答案。

In general, multicpu or multicore processing help most when your problem is CPU bound (ie, spends most of its time with the CPU running as fast as it can). 通常,当您的问题受到CPU限制时,多CPU或多核处理最有帮助(即,大部分时间都在尽可能快地运行CPU)。

From your description, you have an IO bound problem: It takes forever to get data from disk to the CPU (which is idle) and then the CPU operation is very fast (because it is so simple). 根据您的描述,您有一个IO限制问题:从磁盘到CPU(空闲)获取数据需要永远,然后CPU操作非常快(因为它非常简单)。

Thus, accelerating the CPU operation does not make a very big difference overall. 因此,加速CPU操作并不会产生很大的差异。

Here are a couple of questions: 以下是几个问题:

  1. In your processR function, does it slurp a large number of records from the database at one time, or is it fetching 1 row at a time? 在您的processR函数中,它是否一次从数据库中篡改大量记录,还是一次取出一行? (Each row fetch will be very costly, performance wise.) (每次获取行都非常昂贵,性能明智。)

  2. It may not work for your specific application, but since you are processing "everything", using database will likely be slower than a flat file. 它可能不适用于您的特定应用程序,但由于您正在处理“所有内容”,因此使用数据库可能比平面文件慢。 Databases are optimised for logical queries, not seqential processing. 数据库针对逻辑查询进行了优化,而非针对序列处理。 In your case, can you export the whole table column to a CSV file, process it, and then re-import the results? 在您的情况下,您可以将整个表列导出到CSV文件,处理它,然后重新导入结果吗?

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM