So, I assume this is a pretty typical use case, but I can't really find anything about support for this in the lxml
documentation. Basically I've got an xml file that consists of a number of distinct xml documents (reviews in particular) The structure is approximately:
<review>
<!-- A bunch of metadata -->
</review>
<!-- The issue is here -->
<review>
<!-- A bunch of metadata -->
</review>
Basically, I try to read the file in like so:
import lxml
document = lxml.etree.fromstring(open(xml_file).read())
But I get an error when I do so:
lxml.etree.XMLSyntaxError: Extra content at the end of the document
Totally reasonable error, in fact it is an xml error and should be treated as such, but my question is: how do I get lxml
to recognize that this is a list of xml documents and to parse accordingly?
list_of_reviews = lxml.magic(open(xml_file).read())
Is magic
a real lxml
function?
So, it's a little hacky, but should be relatively robust. There are two main negatives here:
Basically the approach is to find the thrown errors and then parse just the section of the file above the error. If an error that isn't related to the last of a root node is thrown then it is handled like a typical exception.
def fix_xml_list(test_file):
documents = []
finished = False
while not finished:
try:
lxml.etree.fromstring(test_file)
except XMLSyntaxError as e:
if e.code == 5 and e.position[1] == 1:
doc_end = e.position[0]
end_char = find_nth(test_file, '\n', doc_end - 2)
documents.append(lxml.etree.fromstring(test_file[:end_char]))
if end_char == len(test_file):
finished = True
test_file = test_file[end_char:]
else:
print e
break
return documents
def find_nth(doc, search, n=0):
l = len(search)
i = -l
for c in xrange(n + 1):
i = doc.find(search, i + l)
if i < 0:
break
return i
The find_nth
code is shamelessly stolen from this question. It's possible that there aren't many situations where this code is deeply useful, but for me with a large number of slightly irregular documents (very common with academic data) it's invaluable.
XML documents must have a single root element; otherwise, they are not well-formed , and are, in fact, not XML. Conformant parsers cannot parse non-well-formed "XML".
When you construct your single XML document out of multiple documents, simply wrap the disparate root elements in a single root element. Then you'll be able to use standard parsers such as lxml.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.