简体繁体 English

手动运行多个程序实例的好处

[英]benefits of manually running multiple instances of a program

原文 2017-12-28 21:32:08 1 2 python

So i've googled multithreading for python3 and not quite found what i'm looking for.所以我在 google 上搜索了 python3 的多线程，并没有完全找到我要找的东西。

I have a python module that goes to a given path and scrapes the data from a bunch of excel files (.xlsx using openpyxl) and outputs a csv to go into my sql db.我有一个 python 模块，它转到给定的路径并从一堆 excel 文件（使用 openpyxl 的 .xlsx）中抓取数据并输出一个 csv 以进入我的 sql 数据库。 right now it takes ~20-25 min to go through all 160+ files (large files, not concerned with time per file per se).现在需要大约 20-25 分钟来浏览所有 160 多个文件（大文件，不关心每个文件本身的时间）。 i split them into 2 different directories of ~80 each and ran two instances of idle at the same time, once in each directory ('path\\test1\\' and 'path\\test2\\').我将它们分成 2 个不同的目录，每个目录约 80 个，并同时运行两个空闲实例，在每个目录中运行一次（'path\\test1\\' 和 'path\\test2\\'）。

This took 16 minutes with these 2 instances of python running at the same time.这两个 python 实例同时运行需要 16 分钟。 what are the limitations/concerns with running this way or even expanding to 4 instances of python running at once?以这种方式运行甚至扩展到同时运行 4 个 python 实例有什么限制/顾虑？

notes:笔记：

the data scraped from excel is totally independent for each file, so no interaction is needed until i combine csv outputs for upload later.从 excel 中抓取的数据对于每个文件是完全独立的，因此在我合并 csv 输出以供稍后上传之前不需要交互。
on a work laptop, HP elitebook with quad core cpu在工作笔记本电脑上，带有四核 CPU 的 HP Elitebook

Thanks in advance.提前致谢。

Btw - this got me interested in learning c# for it's multithreading capabilities.顺便说一句 - 这让我对学习 c# 感兴趣，因为它的多线程功能。

2 个解决方案

A single instance of your Python module is likely only able to take advantage of a single core at a time. Python 模块的单个实例一次可能只能利用一个核心。 If your process is CPU limited, you'll see the benefits of this kind of parallelism decline as all of your cores become utilized.如果您的进程受 CPU 限制，您会看到这种并行性的好处随着您的所有内核都被利用而下降。 You may find that if your process is disk IO heavy, you'll see your performance tail-off sooner as IO needs scale with the number of processes.您可能会发现，如果您的进程占用大量磁盘 IO，您将很快看到性能下降，因为 IO 需要随着进程数量的增加而扩展。

In either case, on a quad-core cpu with a single disk, you'll see the benefits of paralellism fall off with no more than a few threads/processes.在任何一种情况下，在具有单个磁盘的四核 CPU 上，您都会看到并行性的好处在不超过几个线程/进程的情况下下降。 It might not be worth your effort to explicitly multi-thread this kind of task beyond running a few instances of the script the existing way.除了以现有方式运行脚本的几个实例之外，显式地多线程处理此类任务可能不值得。

Your program has to:你的程序必须：

Read the data from the hard drive into memory.将硬盘中的数据读入内存。
Do some processing in memory (parse the data).在内存中做一些处理（解析数据）。
Write the new data from memory back to the the hard drive.将内存中的新数据写回硬盘。

Each of these has its own limitations.. eg.这些中的每一个都有其自身的局限性......例如。 the hard drive has specific limits:硬盘有特定限制：

How fast it can read from the disk.从磁盘读取的速度有多快。
How fast it can write to the disk.写入磁盘的速度有多快。
How fast the drive can "seek" ..move the head from one part of the disk to another and locate the correct sector.驱动器“寻找”的速度有多快……将磁头从磁盘的一个部分移动到另一个部分并找到正确的扇区。 This matters more when you are accessing many different files at once.当您一次访问许多不同的文件时，这一点更重要。

In a mechanical hard disk, seeking involves literally moving the read/write head across the disk then waiting for the correct sector to pass under the head.在机械硬盘中，查找包括在磁盘上移动读/写磁头，然后等待正确的扇区通过磁头下方。 In a solid state drive (SSD), this mechanical problem does not exist, which is one of the advantages of SSD.在固态硬盘（SSD）中，这种机械问题是不存在的，这也是SSD的优势之一。

But if you are using a disk drive which does have the issue of seek time (all mechanical disks), and you run two copies of your program, you are using four files at the same time and the disk drive head has to constantly move from the location of one file to another.但是，如果您使用的磁盘驱动器确实存在寻道时间问题（所有机械磁盘），并且您运行程序的两个副本，那么您将同时使用四个文件，并且磁盘驱动器磁头必须不断从一个文件的位置到另一个。 This takes time.这需要时间。

Then there are limits to the speed of:然后有速度限制：

Moving data in and out of memory.将数据移入和移出内存。
How fast the processor processes the data.处理器处理数据的速度。

Running more than one copy of your program allows more cores of the processor to be used.. so you can increase the overall processing speed.运行一个以上的程序副本可以使用更多的处理器内核……因此您可以提高整体处理速度。 But if everything is stored on the same disk, you can only go so far before you run into a limitation on your reading, writing and seeking speeds.但是，如果所有内容都存储在同一个磁盘上，那么在遇到读取、写入和查找速度的限制之前，您只能走这么远。 So, after a point, running more processes won't help, because that's not what's holding you back.因此，在某种程度上，运行更多进程将无济于事，因为这不是阻碍您前进的原因。

Every operating system has ways of viewing the resources being used at any given moment.每个操作系统都有查看在任何给定时刻正在使用的资源的方法。 In Windows this is "Task Manager" (Performance Tab).在 Windows 中，这是“任务管理器”（性能选项卡）。 On unix-like systems there's a program called "top".在类 Unix 系统上，有一个名为“top”的程序。 Observe these programs while your task is running and it will tell you where your bottleneck is (reading, writing, cpu, network etc).在您的任务运行时观察这些程序，它会告诉您瓶颈在哪里（读取、写入、cpu、网络等）。 If for example the disk is at 100%, and CPU at 50% then your program is stuck waiting for the disk and running more processes won't help you.例如，如果磁盘处于 100%，而 CPU 处于 50%，那么您的程序会卡在等待磁盘并且运行更多进程对您没有帮助。

My educated guess is that you cannot go much further optimizing this without spreading the data onto additional hard disks.我有根据的猜测是，如果不将数据分散到额外的硬盘上，您就无法进一步优化它。 You say you're on a laptop so you most likely only have one hard disk installed, but if you have a fast external disk connection (USB3/ESATA/lightning) then you can probably speed your process up by splitting the job between disks.你说你在一台笔记本电脑上，所以你很可能只安装了一个硬盘，但如果你有一个快速的外部磁盘连接（USB3/ESATA/闪电），那么你可以通过在磁盘之间拆分工作来加快你的进程。

There are two ways to split it.. by dividing your files in half and doing one set on one disk, and the other on another disk.有两种分割方法.. 将文件一分为二，一组放在一张磁盘上，另一组放在另一张磁盘上。 The other way to slice it is to read all your files from one disk, and write to the other.另一种切片方法是从一个磁盘读取所有文件，然后写入另一个磁盘。 This means that each drive does not have to seek (move from track to track) on the disk as much, therefore speeds things up.这意味着每个驱动器不必在磁盘上寻找（从一个磁道移动到另一个磁道），因此可以加快速度。

If you only have a USB flash drive, you can try to use that.. if it's USB3 it may help you.如果您只有一个 USB 闪存驱动器，您可以尝试使用它。如果它是 USB3，它可能对您有所帮助。 But in that case, only read your XLS files off the flash drive, and write your CSV files to the regular hard disk in your laptop.但在这种情况下，只能从闪存驱动器中读取 XLS 文件，并将 CSV 文件写入笔记本电脑的常规硬盘中。 Flash drives have very slow write speed compared to most hard disks.与大多数硬盘相比，闪存驱动器的写入速度非常慢。

You already know that running two processes speeds things up to the point where the disk becomes the limitation, so run two processes per disk.您已经知道运行两个进程会加速到磁盘成为限制的程度，因此每个磁盘运行两个进程。 Keep in mind that the more files you access on the same hard disk at the same time, the more the drive will have to seek.请记住，您在同一硬盘上同时访问的文件越多，驱动器搜索的次数就越多。

Some people make whole careers of solving these kinds of problems.. so you'll have to work with it a little bit to figure out the optimal use of whatever hardware you have.有些人的整个职业生涯都在解决这类问题......所以你必须稍微使用它来弄清楚你拥有的任何硬件的最佳用途。

Another option for you that comes to mind is to write your program so that instead of writing a CSV file that then writes to your database, write directly to the database.您想到的另一个选择是编写您的程序，而不是编写一个 CSV 文件然后写入您的数据库，而是直接写入数据库。 This will take longer, but eliminates a step so that the whole job may take less time.这将需要更长的时间，但省去了一个步骤，因此整个工作可能需要更少的时间。

Then, there are other ways to optimize.然后，还有其他优化方法。 For example, if you are stuck working with just one hard disk, you can reduce seeking by reading and writing in larger chunks.例如，如果您只能使用一个硬盘工作，您可以通过读取和写入更大的块来减少搜索。 For example, let's say that right now you read a single record from the disk, process it, then write it out.. and you do this for 100 million records.例如，假设现在您从磁盘中读取一条记录，对其进行处理，然后将其写出......并且您对 1 亿条记录执行此操作。 The operating system will already try to optimize reading and writing behavior, but you'll still have quite a lot of seeking as the reads and writes intersperse.操作系统已经尝试优化读取和写入行为，但是随着读取和写入的交错，您仍然会有很多搜索。 But if, let's say, you can read 10 million records at a time into memory, process them all, then write them out at once, you'll likely get better performance.但是，假设您可以一次将 1000 万条记录读入内存，将它们全部处理，然后一次将它们写出，您可能会获得更好的性能。 Try to avoid doing many small reads and writes.尽量避免做很多小的读取和写入。