简体   繁体   English

Python在同一资源上执行不同的操作

[英]Python different operations on the same resources

I'm trying to analyze a large python file with different tasks on it. 我正在尝试分析具有不同任务的大型python文件。 I've already read and preprocessed the file and it's now in memory. 我已经阅读并预处理了文件,现在它已经在内存中。 The thing is, the tasks I have, they pretty much have to go through the whole list of the records. 问题是,我所拥有的任务,它们几乎必须遍历整个记录列表。 It's something similar like: 类似于:

resourceList = [..] #list of records from the file (say, 2GB)
def taskA():
    for i in resourceList:
        #doSthA()

def taskB():
    for i in resourceList:
        #doSthB()

If I do taskA() then taskB() it's going through the 2GB file twice and it's really slow. 如果我执行taskA(),然后执行taskB(),它将两次遍历2GB的文件,而且速度非常慢。 Is that a way that taskA and taskB can do their job simultaneously at the same time so that I don't have to go through the task twice? 这是taskA和taskB可以同时同时完成其工作的一种方式,这样我就不必两次执行任务了?

I read about something which involved python threads and Queue, is that the only (and right) way to do it? 我读到一些涉及python线程和Queue的东西,这是唯一(正确的)方法吗? If so, what if the "resourceList" is a generator instead of a list? 如果是这样,如果“ resourceList”是生成器而不是列表怎么办?

Thanks! 谢谢!

I'd implement this using threading (because I find this problem easier to reason about when each task is a separate thread, and threading over multiprocessing so that data can be shared), then passing each function a queue which can be iterated over: 我会使用线程来实现这一点(因为我发现这个问题更容易解释何时每个任务是一个单独的线程,并通过多处理线程化以便可以共享数据),然后将每个函数传递给一个可以迭代的队列:

import threading
from Queue import Queue

class IterableQueue(Queue): 
    _sentinel = object()

    def __iter__(self):
        return iter(self.get, self._sentinel)

    def close(self):
        self.put(self._sentinel)

def taskA(resources):
    for resource in resources:
        do_stuff()

def taskB(resources):
    for resource in resources:
        do_stuff()

def start_thread(task):
    queue = IterableQueue(maxsize=1)
    thread = threading.Thread(target=task, args=(queue, ))
    thread.start()
    return (thread, queue)

threads = [
    start_thread(taskA),
    start_thread(taskB),
]

resource_list = [...]

for resource in resource_list:
    for _, queue in threads:
        queue.put(resource)

for thread, _ in threads:
    thread.join()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM