简体   繁体   English

使用多重处理划分和征服etree.iterparse

[英]dividing and conquering etree.iterparse using multiprocessing

so let's imagine a large xml document (filesize > 100 mb) that we want to iterparse using cElementTree.iterparse. 因此,假设我们要使用cElementTree.iterparse进行迭代的大型xml文档(文件大小> 100 mb)。

but all those cores Intel promised us would be worthwhile, how do we put them to use? 但是英特尔承诺所有这些核心都是值得的,我们如何使用它们? here's what I want: 这是我想要的:

from itertools import islice
from xml.etree import ElementTree as etree

tree_iter = etree.iterparse(open("large_file.xml", encoding="utf-8"))

first = islice(tree_iter, 0, 10000)
second = islice(tree_iter, 10000)

parse_first()
parse_second()

There seems to be several problems with this, not the least being that the iterator returned by iterparse() seems to resist slicing. 这似乎有几个问题,尤其是iterparse()返回的迭代器似乎无法切片。

Is there any way to divide the parsing workload of a large xml document into two or four separate tasks (without loading the entire document into memory? the purpose being then to execute the tasks on separate processors. 有没有办法将一个大型xml文档的解析工作量划分为两个或四个单独的任务(而无需将整个文档加载到内存中?)的目的是在不同的处理器上执行这些任务。

I think you need a good threadpool with a task queue for this. 我认为您需要一个具有任务队列的良好线程池。 I found (and use) this very good one (it's in python3, but shouldn't be too hard to convert to 2.x): 我发现(并使用)了这个非常好的代码(它在python3中,但转换为2.x也不应该太难):

# http://code.activestate.com/recipes/577187-python-thread-pool/

from queue import Queue
from threading import Thread

class Worker(Thread):
    def __init__(self, tasks):
        Thread.__init__(self)
        self.tasks = tasks
        self.daemon = True
        self.start()

    def run(self):
        while True:
            func, args, kargs = self.tasks.get()
            try: func(*args, **kargs)
            except Exception as exception: print(exception)
            self.tasks.task_done()

class ThreadPool:
    def __init__(self, num_threads):
        self.tasks = Queue(num_threads)
        for _ in range(num_threads): Worker(self.tasks)

    def add_task(self, func, *args, **kargs):
        self.tasks.put((func, args, kargs))

    def wait_completion(self):
        self.tasks.join()

Now you can just run the loop on the iterparse and let the threadpool divide the work for you. 现在,您可以在iterparse上运行循环,并让线程池为您分担工作。 Using it is a simple as this: 使用它很简单:

def executetask(arg):
    print(arg)

workers = threadpool.ThreadPool(4) # 4 is the number of threads
for i in range(100): workers.add_task(executetask, i)

workers.wait_completion() # not needed, only if you need to be certain all work is done before continuing

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM