简体   繁体   English

Python,在for循环中使用多线程

[英]Python, use multithreading in a for loop

I would like to understand if there is any way to use the multithreading in a for loop, I have a big txt file (35GB), the script needs to split and strip each line and print the result in an another txt file, the problem is it's pretty slow and I would like to make it faster. 我想了解是否有任何方法可以在for循环中使用多线程,我有一个很大的txt文件(35GB),脚本需要拆分并剥离每行并将结果打印到另一个txt文件中,问题它是否很慢,我想使其更快。 I thought about using a lock but I'm still not sure if it could work, anyone have any ideas? 我曾考虑过使用锁,但仍不确定是否可以使用,有人有什么想法吗? Thanks :D 感谢:D

TL;DR the comments: TL; DR评论:

you are almost guaranteed to be limited by the read speed of your hard drive if the computation you are doing on each line is relatively limited. 如果您在每一行上执行的计算相对受限,则几乎可以保证受到硬盘驱动器读取速度的限制。 do some real profiling of your code to find where the slowdown actually is. 对您的代码进行一些真实的分析 ,以找出减速的​​实际位置。 If the data you are writing to file is much smaller than your 35G file (it would all fit in ram), you might just find a speedup by writing it after your read is complete to allow the drive to work entirely sequentially (also maybe not). 如果您要写入文件的数据比35G文件小得多(它们都适合ram),则在读取完成后通过写入数据可以使驱动器完全按顺序工作(可能也没有),从而找到加速的方法。 )。

example of profiling converting text file to csv: 将文本文件转换为csv的性能分析示例:

from cProfile import Profile

def main(debug=False):
    maxdata = 1000000 #read at most (roughly)`maxdata` bytes from file if debug == True
    with open('bigfile.txt', 'r') as fin:
        with open('outfile.csv', 'w') as fout:
            for line in fin:
                fout.write(','.join(line.split())) #split on spaces to convert to csv
                if debug and fin.tell() >= maxdata: #if debug
                    break

Profile.enable()
main(debug=True)
Profile.disable()
Profile.print_stats()

On SSDs and HDDs: 在SSD和HDD上:

As others have pointed out, your main constraint here is going to be your hard drive. 正如其他人指出的那样,这里的主要限制是硬盘驱动器。 If you're using an HDD and not an SSD, you're actually going to see a decrease in performance by attempting to have multiple threads read from the disk at the same time (assuming they're trying to read randomly distributed blocks of data from the disk and are reading sequentially). 如果您使用的是HDD而不是SSD,则实际上是通过尝试同时从磁盘读取多个线程来发现性能下降(假设它们正在尝试读取随机分布的数据块)从磁盘读取,并按顺序读取)。

If you look at how a hard drive works, it has a head must seek (scan) to find the location of the data you're attempting to read. 如果您查看硬盘驱动器的工作方式,则必须有一个磁头必须寻找(扫描)以找到您要读取的数据的位置。 If you have multiple threads, they will still be limited by the fact that the hard drive can only read one block at a time. 如果您有多个线程,那么它们将仍然受到硬盘驱动器一次只能读取一个块的限制。 Hard drives perform well when reading/writing sequentially but do not perform well when reading/writing from random locations on the disk. 硬盘驱动器在顺序读取/写入时性能良好,但从磁盘上随机位置读取/写入时性能不佳。

On the other hand if you look at how a solid state drive works, it is the opposite. 另一方面,如果您查看固态驱动器是如何工作的,则相反。 The solid state drive does better at reading from random places in storage. 固态驱动器在从存储中的随机位置读取数据方面表现更好。 SSDs do not have seek latency which makes them great at reading from multiple places on disk. SSD没有寻找延迟,这使得它们非常适合从磁盘上的多个位置读取。

The optimal structure of your program will look different depending on whether or not you're using an HDD or an SSD. 根据您使用的是HDD还是SSD,程序的最佳结构看起来会有所不同。


Optimal HDD Solution: 最佳硬盘解决方案:

Assuming you're using an HDD for storage, your optimal solution looks something like this: 假设您使用HDD进行存储,则最佳解决方案如下所示:

  1. Read a large chunk of data into memory from the main thread. 从主线程将大量数据读取到内存中。 Be sure you read in increments of your block size, which will increase performance. 确保以块大小为增量读取,这将提高性能。

    • If your HDD stores data in blocks of 4kB (or 4096 bytes), you should read in multiples of 4096. Most modern disk sectors (another term for blocks) are 4kB. 如果您的HDD以4kB(或4096字节)的块存储数据,则应以4096的倍数读取。大多数现代磁盘扇区(块的另一个术语)为4kB。 Older legacy disks will have sectors of 512 bytes. 较旧的旧磁盘将具有512字节的扇区。 You can find out how big your blocks/sectors are by using lsblk or fdisk on linux. 您可以通过在Linux上使用lsblkfdisk来找出您的块/扇区有lsblk
    • You will need to play around with different multiples of your block size, steadily increasing the amount of data you're reading, to see what size gives the best performance. 您将需要使用不同的块大小倍数,稳定增加正在读取的数据量,以查看哪种大小可获得最佳性能。 If you read too much data in at once your program will be inefficient (because of read speeds). 如果一次读取太多数据,则您的程序将效率低下(由于读取速度)。 If you don't read enough data in at once, your program will also be inefficient (because of too many reads). 如果一次读取的数据不足,则您的程序也将效率不高(因为读取次数过多)。
    • I'd start with 10 times your block size, then 20 times your block size, then 30 times your block size, until you find the optimal size of data to read in at once. 首先,我将使用块大小的10倍,然后是块大小的20倍,然后是块大小的30倍,直到找到可以一次读取的最佳数据大小。
  2. Once your main thread has read from disk, you can spawn multiple threads to process the data. 一旦从磁盘读取了主线程,就可以产生多个线程来处理数据。

    • Since python has a GIL (global interpreter lock) for thread safety, you may want to use multiprocessing instead. 由于python具有GIL(全局解释器锁)以确保线程安全,因此您可能希望使用多处理。 The multiprocessing library is very similar to the threading library. multiprocessing库与threading库非常相似。
  3. While the child threads/processes are processing the data, have the main thread read in another chunk of data from the disk. 当子线程/进程正在处理数据时,请主线程从磁盘读取另一块数据。 Wait until the children have finished to spawn more for processing, and keep repeating this process. 等到孩子们完成产卵后再进行处理,然后继续重复此过程。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM