简体   繁体   English

如何利用Python中的线程解析大文件?

[英]How to parse a large file taking advantage of threading in Python?

I have a huge file and need to read it and process. 我有一个巨大的文件,需要阅读和处理。

with open(source_filename) as source, open(target_filename) as target:
    for line in source:
        target.write(do_something(line))

    do_something_else()

Can this be accelerated with threads? 这可以通过线程加速吗? If I spawn a thread per line, will this have a huge overhead cost? 如果我每行产生一个线程,这会产生巨大的开销吗?

edit: To make this question not a discussion, How should the code look like? 编辑:为了使这个问题不是讨论, 代码应该如何?

with open(source_filename) as source, open(target_filename) as target:
   ?

@Nicoretti: In an iteration I need to read a line of several KB of data. @Nicoretti:在迭代中,我需要读取一行数KB的数据。

update 2: the file may be a bz2, so Python may have to wait for unpacking: 更新2:文件可能是bz2,因此Python可能必须等待解压缩:

$ bzip2 -d country.osm.bz2 | ./my_script.py

You could use three threads: for reading, processing and writing. 您可以使用三个线程:用于读取,处理和写入。 The possible advantage is that the processing can take place while waiting for I/O, but you need to take some timings yourself to see if there is an actual benefit in your situation. 可能的优点是可以在等待I / O时进行处理,但您需要自己考虑一些时间,以确定在您的情况下是否有实际的好处。

import threading
import Queue

QUEUE_SIZE = 1000
sentinel = object()

def read_file(name, queue):
    with open(name) as f:
        for line in f:
            queue.put(line)
    queue.put(sentinel)

def process(inqueue, outqueue):
    for line in iter(inqueue.get, sentinel):
        outqueue.put(do_something(line))
    outqueue.put(sentinel)

def write_file(name, queue):
    with open(name, "w") as f:
        for line in iter(queue.get, sentinel):
            f.write(line)

inq = Queue.Queue(maxsize=QUEUE_SIZE)
outq = Queue.Queue(maxsize=QUEUE_SIZE)

threading.Thread(target=read_file, args=(source_filename, inq)).start()
threading.Thread(target=process, args=(inq, outq)).start()
write_file(target_filename, outq)

It is a good idea to set a maxsize for the queues to prevent ever-increasing memory consumption. 为队列设置maxsize是一个好主意,以防止不断增加的内存消耗。 The value of 1000 is an arbitrary choice on my part. 对我而言,1000的值是任意选择。

Does the processing stage take relatively long time, ie, is it cpu-intenstive? 处理阶段是否需要相对较长的时间,即,它是否具有cpu-intenstive? If not, then no, you dont win much by threading or multiprocessing it. 如果没有,那么不,你通过线程化或多处理它不会赢得太多。 If your processing is expensive, then yes. 如果您的处理费用昂贵,那么是的。 So, you need to profile to know for sure. 所以,你需要剖析才能确定。

If you spend relatively more time reading the file, ie it is big, than processing it, then you can't win in performance by using threads, the bottleneck is just the IO which threads dont improve. 如果你花费相对更多的时间来阅读文件,即它比处理文件要大,那么你就无法通过使用线程获得性能,瓶颈只是线程不改进的IO。

This is the exact sort of thing which you should not try to analyse a priori , but instead should profile. 这是你不应该尝试分析先验的确切类型,而应该分析。

Bear in mind that threading will only help if the per-line processing is heavy. 请记住,线程只有在每行处理很重时才有用。 An alternative strategy would be to slurp the whole file into memory, and process it in memory, which may well obviate threading. 另一种策略是将整个文件粘贴到内存中,并在内存中处理它,这可能会避免线程化。

Whether you have a thread per line is, once again, something for fine-tuning, but my guess is that unless parsing the lines is pretty heavy, you may want to use a fixed number of worker threads. 你是否每行都有一个线程可以再次进行微调,但我的猜测是,除非解析这些行很重,否则你可能想要使用固定数量的工作线程。

There is another alternative: spawn sub-processes, and have them do the reading, and the processing. 还有另一种选择:生成子流程,让它们进行读取和处理。 Given your description of the problem, I would expect this to give you the greatest speed-up. 鉴于您对问题的描述,我希望这能为您提供最快的速度。 You could even use some sort of in-memory caching system to speed up the reading, such as memcached (or any of the similar-ish systems out there, or even a relational database). 您甚至可以使用某种内存缓存系统来加速读取,例如memcached(或任何类似的系统,甚至是关系数据库)。

In CPython, threading is limited by the global interpreter lock — only one thread at a time can actually be executing Python code. 在CPython中,线程受全局解释器锁的限制 - 一次只有一个线程可以实际执行Python代码。 So threading only benefits you if either: 所以线程只有在以下情况下才有益于:

  1. you are doing processing that doesn't require the global interpreter lock; 你正在进行不需要全局解释器锁定的处理; or 要么

  2. you are spending time blocked on I/O. 你在I / O上花了一些时间。

Examples of (1) include applying a filter to an image in the Python Imaging Library , or finding the eigenvalues of a matrix in numpy . (1)的示例包括将过滤器应用于Python成像库中的图像 ,或者在numpy中找到矩阵的特征值。 Examples of (2) include waiting for user input, or waiting for a network connection to finish sending data. (2)的示例包括等待用户输入或等待网络连接以完成发送数据。

So whether your code can be accelerated using threads in CPython depends on what exactly you are doing in the do_something call. 因此,是否可以使用CPython中的线程加速代码取决于您在do_something调用中的具体操作。 (But if you are parsing the line in Python then it very unlikely that you can speed this up by launching threads.) You should also note that if you do start launching threads then you will face a synchronization problem when you are writing the results to the target file. (但是如果你在Python中解析这一行,你就不太可能通过启动线程来加快速度。)你还应该注意,如果你开始启动线程,那么当你将结果写入时,你将面临同步问题。目标文件。 There is no guarantee that threads will complete in the same order that they were started, so you will have to take care to ensure that the output comes out in the right order. 无法保证线程按照它们启动的顺序完成,因此您必须注意确保输出按正确的顺序排列。

Here's a maximally threaded implementation that has threads for reading the input, writing the output, and one thread for processing each line. 这是一个最大线程实现,它具有读取输入,写入输出和一个线程来处理每一行的线程。 Only testing will tell you if this faster or slower than the single-threaded version (or Janne's version with only three threads). 只有测试会告诉您这是否比单线程版本更快或更慢(或只有三个线程的Janne版本)。

from threading import Thread
from Queue import Queue

def process_file(f, source_filename, target_filename):
    """
    Apply the function `f` to each line of `source_filename` and write
    the results to `target_filename`. Each call to `f` is evaluated in
    a separate thread.
    """
    worker_queue = Queue()
    finished = object()

    def process(queue, line):
        "Process `line` and put the result on `queue`."
        queue.put(f(line))

    def read():
        """
        Read `source_filename`, create an output queue and a worker
        thread for every line, and put that worker's output queue onto
        `worker_queue`.
        """
        with open(source_filename) as source:
            for line in source:
                queue = Queue()
                Thread(target = process, args=(queue, line)).start()
                worker_queue.put(queue)
        worker_queue.put(finished)

    Thread(target = read).start()
    with open(target_filename, 'w') as target:
        for output in iter(worker_queue.get, finished):
            target.write(output.get())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM