简体   繁体   中英

benefits of manually running multiple instances of a program

So i've googled multithreading for python3 and not quite found what i'm looking for.

I have a python module that goes to a given path and scrapes the data from a bunch of excel files (.xlsx using openpyxl) and outputs a csv to go into my sql db. right now it takes ~20-25 min to go through all 160+ files (large files, not concerned with time per file per se). i split them into 2 different directories of ~80 each and ran two instances of idle at the same time, once in each directory ('path\\test1\\' and 'path\\test2\\').

This took 16 minutes with these 2 instances of python running at the same time. what are the limitations/concerns with running this way or even expanding to 4 instances of python running at once?

notes:

  • the data scraped from excel is totally independent for each file, so no interaction is needed until i combine csv outputs for upload later.

  • on a work laptop, HP elitebook with quad core cpu

Thanks in advance.

Btw - this got me interested in learning c# for it's multithreading capabilities.

A single instance of your Python module is likely only able to take advantage of a single core at a time. If your process is CPU limited, you'll see the benefits of this kind of parallelism decline as all of your cores become utilized. You may find that if your process is disk IO heavy, you'll see your performance tail-off sooner as IO needs scale with the number of processes.

In either case, on a quad-core cpu with a single disk, you'll see the benefits of paralellism fall off with no more than a few threads/processes. It might not be worth your effort to explicitly multi-thread this kind of task beyond running a few instances of the script the existing way.

Your program has to:

  • Read the data from the hard drive into memory.
  • Do some processing in memory (parse the data).
  • Write the new data from memory back to the the hard drive.

Each of these has its own limitations.. eg. the hard drive has specific limits:

  • How fast it can read from the disk.
  • How fast it can write to the disk.
  • How fast the drive can "seek" ..move the head from one part of the disk to another and locate the correct sector. This matters more when you are accessing many different files at once.

In a mechanical hard disk, seeking involves literally moving the read/write head across the disk then waiting for the correct sector to pass under the head. In a solid state drive (SSD), this mechanical problem does not exist, which is one of the advantages of SSD.

But if you are using a disk drive which does have the issue of seek time (all mechanical disks), and you run two copies of your program, you are using four files at the same time and the disk drive head has to constantly move from the location of one file to another. This takes time.

Then there are limits to the speed of:

  • Moving data in and out of memory.
  • How fast the processor processes the data.

Running more than one copy of your program allows more cores of the processor to be used.. so you can increase the overall processing speed. But if everything is stored on the same disk, you can only go so far before you run into a limitation on your reading, writing and seeking speeds. So, after a point, running more processes won't help, because that's not what's holding you back.

Every operating system has ways of viewing the resources being used at any given moment. In Windows this is "Task Manager" (Performance Tab). On unix-like systems there's a program called "top". Observe these programs while your task is running and it will tell you where your bottleneck is (reading, writing, cpu, network etc). If for example the disk is at 100%, and CPU at 50% then your program is stuck waiting for the disk and running more processes won't help you.

My educated guess is that you cannot go much further optimizing this without spreading the data onto additional hard disks. You say you're on a laptop so you most likely only have one hard disk installed, but if you have a fast external disk connection (USB3/ESATA/lightning) then you can probably speed your process up by splitting the job between disks.

There are two ways to split it.. by dividing your files in half and doing one set on one disk, and the other on another disk. The other way to slice it is to read all your files from one disk, and write to the other. This means that each drive does not have to seek (move from track to track) on the disk as much, therefore speeds things up.

If you only have a USB flash drive, you can try to use that.. if it's USB3 it may help you. But in that case, only read your XLS files off the flash drive, and write your CSV files to the regular hard disk in your laptop. Flash drives have very slow write speed compared to most hard disks.

You already know that running two processes speeds things up to the point where the disk becomes the limitation, so run two processes per disk. Keep in mind that the more files you access on the same hard disk at the same time, the more the drive will have to seek.

Some people make whole careers of solving these kinds of problems.. so you'll have to work with it a little bit to figure out the optimal use of whatever hardware you have.

Another option for you that comes to mind is to write your program so that instead of writing a CSV file that then writes to your database, write directly to the database. This will take longer, but eliminates a step so that the whole job may take less time.

Then, there are other ways to optimize. For example, if you are stuck working with just one hard disk, you can reduce seeking by reading and writing in larger chunks. For example, let's say that right now you read a single record from the disk, process it, then write it out.. and you do this for 100 million records. The operating system will already try to optimize reading and writing behavior, but you'll still have quite a lot of seeking as the reads and writes intersperse. But if, let's say, you can read 10 million records at a time into memory, process them all, then write them out at once, you'll likely get better performance. Try to avoid doing many small reads and writes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM