简体   繁体   English

使用lxml.etree.iterparse的线程问题

[英]threading issues using lxml.etree.iterparse

I have a thread which spawns multiple consumer processes that do some heavy processing work on large xml files. 我有一个线程,它产生多个使用者进程,这些进程对大型xml文件进行大量处理。

My design for this was to use a simple single thread for parsing the inbound stream on the fly and shove new objects into a multiprocessing.queues.queue class contained within a buffer process manager. 为此,我的设计是使用一个简单的单线程动态分析入站流,并将新对象推送到缓冲区进程管理器中包含的multiprocessing.queues.queue类中。 The process manager periodically checks the size of the queue and if consumption is letting the queue fill up too quickly it kicks off another consumer. 流程管理器会定期检查队列的大小,如果消耗量使队列填充得太快,则会启动另一个消耗者。

My problem is that the code to join to the closed queue when stream parsing has completed is executing before the xml has finished being parsed!? 我的问题是流解析完成时要加入关闭队列的代码是在xml完成​​解析之前执行的! This doesn't seem to me to be how the following code is supposed to work. 在我看来,这似乎不应该是以下代码的工作方式。 Keep in mind that the following code is completely single-threaded. 请记住,以下代码是完全单线程的。 It is neither called nor used by any SMP code: 任何SMP代码都不会调用或使用它:

clear_ok = False
context = lxml.etree.iterparse(response, events=('end',))
for event, elem in context:
    # Use QName to avoid specifying or stripping the namespace, which we don't need
    if lxml.etree.QName(elem.tag).localname.upper() in obj_elem_map:
        import_buffer.add(obj_elem_map[lxml.etree.QName(elem.tag).localname.upper()](elem=elem))
        clear_ok = True
    if clear_ok:
        elem.clear() #don't fill up a dom we don't need.
        clear_ok = False
results = import_buffer.finish() if block else import_buffer

when import_buffer.finish() is called the following happens: 调用import_buffer.finish()时,会发生以下情况:

def finish(self):
    '''
    Notifies the buffer that we are done filling it.
    This command binds to any processes still running and lets them
    finish and then copies and flushes the managed results list.
    '''
    # close the queue and wait until it is consumed
    self.queue.close()
    self.queue.join_thread()
    # make sure the consumers are done consuming the queue
    for csmr in self.running:
        csmr.join()
    # turn this into a list instead of a managed list
    result = list(self.results_list)
    del self.results_list[:]
    if self.callback:
        return self.callback(result)
    else:
        return result

However I'm getting an exception that that close() has been called on the queue before I've finished parsing? 但是我得到一个例外,就是我完成解析之前 ,已经在队列上调用了close()?

Traceback (most recent call last):
  File "./tests/test_smp_framework.py", line 103, in test_kqb_parser_fromfile
    qkbobs = actions.queryQKB(file=fname)
  File "/Users/skyleach/src/smpparser/smpparser/api_actions.py", line 339, in queryQKB
    result = self.parseResponse(source=sourcefile)File "/Users/skyleach/src/smpparser/smpparser/smpapi.py", line 535, in parseResponse
    import_buffer.add(obj_elem_map[lxml.etree.QName(elem.tag).localname.upper()](elem=elem))
  File "/Users/skyleach/src/smpparser/smpparser/smpapi.py", line 212, in add
    self.queue.put(item)
  File "/usr/local/Cellar/python3/3.5.0/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/queues.py", line 81, in put
    assert not self._closed
AssertionError

This wasn't lxml's fault or multiprocessing's fault, it was a context assignment initialization problem. 这不是lxml的错误或多处理的错误,这是上下文分配初始化问题。 Basically I coded too fast and made a dumb mistake. 基本上,我编码太快了,犯了一个愚蠢的错误。

In my buffer class definition I was setting up the queue rather than in the init function. 在我的缓冲区类定义中,我正在设置队列,而不是在init函数中。 This means that all queues for all instances of the class were one queue defined when the module was imported by the import thread. 这意味着该类的所有实例的所有队列都是由导入线程导入模块时定义的一个队列。

That's what I get for writing code too fast. 那就是我因为太快地编写代码而得到的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM