通过lxml.etree.iterparse解析单个文件中的几个XML声明

Question

I need to parse a file that contains various XML files, ie, <xml></xml> <xml></xml> .. and so forth. 我需要解析一个包含各种XML文件的文件，即<xml> </ xml> <xml> </ xml> ..等等。 While using etree.iterparse, I get the following (correct) error: 使用etree.iterparse时，出现以下（正确）错误：

lxml.etree.XMLSyntaxError: XML declaration allowed only at the start of the document

Now, I can preprocess the input file and produce for each contained XML file a separate file. 现在，我可以预处理输入文件，并为每个包含的XML文件生成一个单独的文件。 This might be the easiest solution. 这可能是最简单的解决方案。 But I wonder if a proper solution for this 'problem' exists. 但是我想知道是否存在针对此“问题”的适当解决方案。

Thanks! 谢谢！

Answer 1

The sample data you've provided suggests one problem, while the question and the exception you've provided suggests another. 您提供的样本数据提示了一个问题，而您提供的问题和异常提示了另一个问题。 Do you have multiple XML documents concatenated together, each with its own XML declaration, or do you have an XML fragment with multiple top-level elements? 您是否有多个串联在一起的XML文档，每个文档都有自己的XML声明，或者您是否有一个包含多个顶级元素的XML片段？

If it's the former, then the solution's going to involve breaking the input stream up into multiple streams, and parsing each one individually. 如果是前者，则解决方案将涉及将输入流分成多个流，并分别解析每个流。 This doesn't necessarily mean, as one comment suggests, implementing an XML parser. 正如一个评论所暗示的，这并不一定意味着实现XML解析器。 You can search a string for XML declarations without having to parse anything else in it, so long as your input doesn't include CDATA sections that contain unescaped XML declarations. 您可以在字符串中搜索XML声明，而不必解析其中的任何其他内容，只要您的输入不包括包含未转义XML声明的CDATA部分即可。 You can write a file-like object that returns characters from the underlying stream until it hits an XML declaration, and then wrap it in a generator function that keeps returning streams until EOF is reached. 您可以编写一个类似文件的对象，该对象从基础流返回字符，直到命中XML声明为止，然后将其包装在生成器函数中，该函数一直返回流，直到到达EOF。 It's not trivial, but it's not hugely difficult either. 这不是微不足道的，但是也不是那么困难。

If you have an XML fragment with multiple top-level elements, you can just wrap them an XML element and parse the whole thing. 如果您有一个包含多个顶级元素的XML片段，则可以将它们包装为一个XML元素并解析整个内容。

Of course, as with most problems involving bad XML input, the easiest solution may just be to fix the thing that's producing the bad input. 当然，与涉及不良XML输入的大多数问题一样，最简单的解决方案可能就是修复产生不良输入的问题。

Answer 2

I used regex to solve this problem. 我用正则表达式解决了这个问题。 Suppose that data is a string that contains your multiple xml documents, and that handle is a function that will do something with each document. 假设数据是一个包含多个xml文档的字符串，并且该句柄是一个将对每个文档执行某些操作的函数。 After executing this loop, data will be empty, or will contain an incomplete XML document, and the handle function will have been called zero or more times. 执行此循环后，数据将为空，或包含不完整的XML文档，并且句柄函数将被调用零次或多次。

while True:
  match = re.match (r'''
        \s*                 # ignore leading whitespace
        (                   # start first group
          <(?P<TAG>\S+).*?> # opening tag (with optional attributes)
            .*?             # stuff in the middle
          </(?P=TAG)>       # closing tag
        )                   # end of first xml document
        (?P<REM>.*)         # anything else
      ''',
    data, re.DOTALL | re.VERBOSE)
  if not match:
    break
  document = match.group (1)
  handle (document)
  data = match.group ('REM')

通过lxml.etree.iterparse解析单个文件中的几个XML声明

问题描述

2 个解决方案

解决方案1
3 已采纳 2011-04-13 17:16:03

解决方案2
0 2011-05-12 20:03:37

通过lxml.etree.iterparse解析单个文件中的几个XML声明

问题描述

2 个解决方案

解决方案1 3 已采纳 2011-04-13 17:16:03

解决方案2 0 2011-05-12 20:03:37

解决方案1
3 已采纳 2011-04-13 17:16:03

解决方案2
0 2011-05-12 20:03:37