简体   繁体   English

使用lxml解析具有多个xml文档的文件

[英]Parse file with several xml documents using lxml

So, I assume this is a pretty typical use case, but I can't really find anything about support for this in the lxml documentation. 因此,我认为这是一个非常典型的用例,但是我真的无法在lxml文档中找到有关对此的任何支持。 Basically I've got an xml file that consists of a number of distinct xml documents (reviews in particular) The structure is approximately: 基本上,我有一个xml文件,其中包含许多不同的xml文档(尤其是审阅)。该结构大致为:

<review>
    <!-- A bunch of metadata -->
</review>
<!-- The issue is here -->
<review>
    <!-- A bunch of metadata -->
</review>

Basically, I try to read the file in like so: 基本上,我尝试像这样读取文件:

import lxml

document = lxml.etree.fromstring(open(xml_file).read())

But I get an error when I do so: 但这样做时会出现错误:

lxml.etree.XMLSyntaxError: Extra content at the end of the document

Totally reasonable error, in fact it is an xml error and should be treated as such, but my question is: how do I get lxml to recognize that this is a list of xml documents and to parse accordingly? 完全合理的错误,实际上是xml错误,应将其视为此类错误,但是我的问题是:如何使lxml识别这是xml文档列表并据此进行解析?

list_of_reviews = lxml.magic(open(xml_file).read())

Is magic a real lxml function? magic是真正的lxml函数吗?

So, it's a little hacky, but should be relatively robust. 因此,这有点笨拙,但应该相对可靠。 There are two main negatives here: 这里有两个主要的负面因素:

  • Repeated calls to fromstring means that this code isn't extremely fast. 重复调用fromstring意味着此代码不是非常快。 About the same speed as parsing each document individually, much slower than if it were all one document 大约与单独解析每个文档相同的速度,比所有文档都慢得多
  • Errors are thrown relative to the current location in the document. 相对于文档中的当前位置会引发错误。 It would be easy to add relative location support (just adding an accumulator to keep track of current location) 添加相对位置支持会很容易(只需添加一个累加器来跟踪当前位置)

Basically the approach is to find the thrown errors and then parse just the section of the file above the error. 基本上,方法是查找引发的错误,然后仅分析错误上方的文件部分。 If an error that isn't related to the last of a root node is thrown then it is handled like a typical exception. 如果引发了与根节点的最后一个无关的错误,则将其作为典型异常进行处理。

def fix_xml_list(test_file):
    documents = []
    finished = False
    while not finished:
        try:
            lxml.etree.fromstring(test_file)
        except XMLSyntaxError as e:
            if e.code == 5 and e.position[1] == 1:
                doc_end = e.position[0]
                end_char = find_nth(test_file, '\n', doc_end - 2)
                documents.append(lxml.etree.fromstring(test_file[:end_char]))
                if end_char == len(test_file):
                    finished = True
                test_file = test_file[end_char:]
            else:
                print e
                break
    return documents

def find_nth(doc, search, n=0):
    l = len(search)
    i = -l
    for c in xrange(n + 1):
        i = doc.find(search, i + l)
        if i < 0:
            break
    return i

The find_nth code is shamelessly stolen from this question. find_nth代码被从这个问题中find_nth偷走了。 It's possible that there aren't many situations where this code is deeply useful, but for me with a large number of slightly irregular documents (very common with academic data) it's invaluable. 可能没有很多情况可以使用此代码,但是对我来说,有大量稍微不规则的文档(在学术数据中很常见),这是非常宝贵的。

XML documents must have a single root element; XML文档必须具有一个根元素。 otherwise, they are not well-formed , and are, in fact, not XML. 否则,它们格式不正确,并且实际上不是XML。 Conformant parsers cannot parse non-well-formed "XML". 合格的解析器无法解析格式不正确的“ XML”。

When you construct your single XML document out of multiple documents, simply wrap the disparate root elements in a single root element. 从多个文档构造单个XML文档时,只需将不同的根元素包装在一个根元素中。 Then you'll be able to use standard parsers such as lxml. 然后,您将能够使用标准解析器,例如lxml。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM