简体   繁体   中英

Parse file with several xml documents using lxml

So, I assume this is a pretty typical use case, but I can't really find anything about support for this in the lxml documentation. Basically I've got an xml file that consists of a number of distinct xml documents (reviews in particular) The structure is approximately:

<review>
    <!-- A bunch of metadata -->
</review>
<!-- The issue is here -->
<review>
    <!-- A bunch of metadata -->
</review>

Basically, I try to read the file in like so:

import lxml

document = lxml.etree.fromstring(open(xml_file).read())

But I get an error when I do so:

lxml.etree.XMLSyntaxError: Extra content at the end of the document

Totally reasonable error, in fact it is an xml error and should be treated as such, but my question is: how do I get lxml to recognize that this is a list of xml documents and to parse accordingly?

list_of_reviews = lxml.magic(open(xml_file).read())

Is magic a real lxml function?

So, it's a little hacky, but should be relatively robust. There are two main negatives here:

  • Repeated calls to fromstring means that this code isn't extremely fast. About the same speed as parsing each document individually, much slower than if it were all one document
  • Errors are thrown relative to the current location in the document. It would be easy to add relative location support (just adding an accumulator to keep track of current location)

Basically the approach is to find the thrown errors and then parse just the section of the file above the error. If an error that isn't related to the last of a root node is thrown then it is handled like a typical exception.

def fix_xml_list(test_file):
    documents = []
    finished = False
    while not finished:
        try:
            lxml.etree.fromstring(test_file)
        except XMLSyntaxError as e:
            if e.code == 5 and e.position[1] == 1:
                doc_end = e.position[0]
                end_char = find_nth(test_file, '\n', doc_end - 2)
                documents.append(lxml.etree.fromstring(test_file[:end_char]))
                if end_char == len(test_file):
                    finished = True
                test_file = test_file[end_char:]
            else:
                print e
                break
    return documents

def find_nth(doc, search, n=0):
    l = len(search)
    i = -l
    for c in xrange(n + 1):
        i = doc.find(search, i + l)
        if i < 0:
            break
    return i

The find_nth code is shamelessly stolen from this question. It's possible that there aren't many situations where this code is deeply useful, but for me with a large number of slightly irregular documents (very common with academic data) it's invaluable.

XML documents must have a single root element; otherwise, they are not well-formed , and are, in fact, not XML. Conformant parsers cannot parse non-well-formed "XML".

When you construct your single XML document out of multiple documents, simply wrap the disparate root elements in a single root element. Then you'll be able to use standard parsers such as lxml.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM