[英]Python, use multithreading in a for loop
I would like to understand if there is any way to use the multithreading in a for loop, I have a big txt file (35GB), the script needs to split and strip each line and print the result in an another txt file, the problem is it's pretty slow and I would like to make it faster. 我想了解是否有任何方法可以在for循环中使用多线程,我有一个很大的txt文件(35GB),脚本需要拆分并剥离每行并将结果打印到另一个txt文件中,问题它是否很慢,我想使其更快。 I thought about using a lock but I'm still not sure if it could work, anyone have any ideas?
我曾考虑过使用锁,但仍不确定是否可以使用,有人有什么想法吗? Thanks :D
感谢:D
TL;DR the comments: TL; DR评论:
you are almost guaranteed to be limited by the read speed of your hard drive if the computation you are doing on each line is relatively limited. 如果您在每一行上执行的计算相对受限,则几乎可以保证受到硬盘驱动器读取速度的限制。 do some real profiling of your code to find where the slowdown actually is.
对您的代码进行一些真实的分析 ,以找出减速的实际位置。 If the data you are writing to file is much smaller than your 35G file (it would all fit in ram), you might just find a speedup by writing it after your read is complete to allow the drive to work entirely sequentially (also maybe not).
如果您要写入文件的数据比35G文件小得多(它们都适合ram),则在读取完成后通过写入数据可以使驱动器完全按顺序工作(可能也没有),从而找到加速的方法。 )。
example of profiling converting text file to csv: 将文本文件转换为csv的性能分析示例:
from cProfile import Profile
def main(debug=False):
maxdata = 1000000 #read at most (roughly)`maxdata` bytes from file if debug == True
with open('bigfile.txt', 'r') as fin:
with open('outfile.csv', 'w') as fout:
for line in fin:
fout.write(','.join(line.split())) #split on spaces to convert to csv
if debug and fin.tell() >= maxdata: #if debug
break
Profile.enable()
main(debug=True)
Profile.disable()
Profile.print_stats()
As others have pointed out, your main constraint here is going to be your hard drive. 正如其他人指出的那样,这里的主要限制是硬盘驱动器。 If you're using an HDD and not an SSD, you're actually going to see a decrease in performance by attempting to have multiple threads read from the disk at the same time (assuming they're trying to read randomly distributed blocks of data from the disk and are reading sequentially).
如果您使用的是HDD而不是SSD,则实际上是通过尝试同时从磁盘读取多个线程来发现性能下降(假设它们正在尝试读取随机分布的数据块)从磁盘读取,并按顺序读取)。
If you look at how a hard drive works, it has a head must seek (scan) to find the location of the data you're attempting to read. 如果您查看硬盘驱动器的工作方式,则必须有一个磁头必须寻找(扫描)以找到您要读取的数据的位置。 If you have multiple threads, they will still be limited by the fact that the hard drive can only read one block at a time.
如果您有多个线程,那么它们将仍然受到硬盘驱动器一次只能读取一个块的限制。 Hard drives perform well when reading/writing sequentially but do not perform well when reading/writing from random locations on the disk.
硬盘驱动器在顺序读取/写入时性能良好,但从磁盘上随机位置读取/写入时性能不佳。
On the other hand if you look at how a solid state drive works, it is the opposite. 另一方面,如果您查看固态驱动器是如何工作的,则相反。 The solid state drive does better at reading from random places in storage.
固态驱动器在从存储中的随机位置读取数据方面表现更好。 SSDs do not have seek latency which makes them great at reading from multiple places on disk.
SSD没有寻找延迟,这使得它们非常适合从磁盘上的多个位置读取。
The optimal structure of your program will look different depending on whether or not you're using an HDD or an SSD. 根据您使用的是HDD还是SSD,程序的最佳结构看起来会有所不同。
Assuming you're using an HDD for storage, your optimal solution looks something like this: 假设您使用HDD进行存储,则最佳解决方案如下所示:
Read a large chunk of data into memory from the main thread. 从主线程将大量数据读取到内存中。 Be sure you read in increments of your block size, which will increase performance.
确保以块大小为增量读取,这将提高性能。
lsblk
or fdisk
on linux. lsblk
或fdisk
来找出您的块/扇区有lsblk
。 Once your main thread has read from disk, you can spawn multiple threads to process the data. 一旦从磁盘读取了主线程,就可以产生多个线程来处理数据。
multiprocessing
library is very similar to the threading
library. multiprocessing
库与threading
库非常相似。 While the child threads/processes are processing the data, have the main thread read in another chunk of data from the disk. 当子线程/进程正在处理数据时,请主线程从磁盘读取另一块数据。 Wait until the children have finished to spawn more for processing, and keep repeating this process.
等到孩子们完成产卵后再进行处理,然后继续重复此过程。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.