简体   繁体   English

如何在python中将XML文件解析为树

[英]how to parse a XML file into a tree in python

***I must use Elementtree for this project, so if you could, please suggest something that utilizes Elementtree

I have a file that kinda looks like this (each separated by a blank line) 我有一个看起来像这样的文件(每个文件都由空白行分隔)

<a>
    <b>
       ....
    </b>
    <c>
       ....
    </c>
</a>
<d><c></c></d>

<a>
    <b>
       ....
    </b>
    <c>
       ....
    </c>
</a>
<d><c></c></d>

<a>
    <b>
       ....
    </b>
    <c>
       ....
    </c>
</a>
<d><c></c></d>

I know this is not a valid XML, so what I am trying to do is read the whole thing as a string and add a root element to that, which would end up looking like this for each XML: 我知道这不是有效的XML,因此我想做的是将整个内容读取为字符串,并在其中添加一个根元素,最终每个XML看起来像这样:

<root>
    <a>
        <b>
           ....
        </b>
        <c>
           ....
        </c>
    </a>
    <d><c></c></d>
</root>

I want to know if there is a simple way to read the XML code one by one and concatenate it with a parent node, and do the same for the next XML code, and so on. 我想知道是否存在一种简单的方法来逐一读取XML代码并将其与父节点连接,并对下一个XML代码执行相同的操作,依此类推。

Any help would be appreciated, thank you. 任何帮助,将不胜感激,谢谢。

It sounds like what you really want to do is parse a sequence of XML trees—maybe more than one in the same file, or maybe there are multiple files, or who knows. 听起来您真正想做的是解析一系列XML树-可能在同一个文件中解析多个XML树,或者可能有多个文件,或者谁知道。

ElementTree can't quite do that out of the box… but you can build something out of it that can. ElementTree不能完全做到这一点……但是您可以利用它来构建一些东西。


First, there's the easy way: Just put your own parser in front of etree. 首先,有一种简单的方法:只需将自己的解析器放在etree前面。 If your XML documents are really separated by blank lines, and there are no embedded lines in any document, this is trivial: 如果您的XML文档实际上由空行分隔,并且在任何文档中都没有嵌入行,则这很简单:

lines = []
for line in inputFile:
    if not line.strip():
        print(lines)
        xml = ET.fromstringlist(lines)
        print(xml)
        lines = []
    else:
        lines.append(line)
print(lines)
xml = ET.fromstringlist(lines)
print(xml)

If the "outer structure" is more complicated than this—eg, if each document begins immediately after the other ends, or if you need stateful information to distinguish within-tree blank lines from between-tree ones—then this solution won't work (or, at least, it will be harder rather than easier). 如果“外部结构”比这更复杂(例如,如果每个文档都在另一端之后立即开始,或者如果您需要状态信息来区分树内空白行和树间空白行),则此解决方案将无法工作(或者,至少,这将更加困难而不是容易)。

In that case, things get more fun. 在这种情况下,事情会变得更加有趣。


Take a look at iterparse . 看看iterparse It lets you parse a document on the fly, yielding each element when it gets to the end of the element (and even trimming the tree as you go along, if the tree is too big to fit into memory). 它使您可以动态分析文档,并在文档到达元素末尾时生成每个元素(如果树太大而无法容纳到内存中,甚至可以对树进行修剪)。

The problem is that when iterparse gets to the end of the file, it will raise a ParseError and abort, instead of going on to the next document. 问题是,当iterparse到达文件末尾时,它将引发ParseError并中止,而不是继续下一个文档。

You can easily detect that by reading the first start element, then stopping as soon as you reach its end . 通过读取第一个start元素,然后在到达end立即停止,可以轻松地检测到该错误。 It's a bit more complicated, but not too bad. 有点复杂,但还算不错。 Instead of this: 代替这个:

for _, elem in ET.iterparse(arg):
    print(elem)

You have to do this: 您必须这样做:

parser = ET.iterparse(arg, events=('start', 'end'))
_, start = next(parser)
while True:
    event, elem = next(parser)
    if event == 'end':
        print(elem)
        if elem == start:
            break

(You can make that a bit more concise with filter and itertools , but I thought the explicit version would be easier to understand for someone who's never used iterparse .) (您可以使用filteritertools使其更加简洁,但是我认为对于从未使用过iterparse人来说,显式版本会更容易理解。)

So, you can just do that in a loop until EOF, right? 因此,您可以循环执行直到EOF,对不对? Well, no. 好吧,不。 The problem is that iterparse doesn't leave the read pointer at the start of the next document, and there's no way to find out where the next document starts. 问题在于iterparse不会将读取指针留在下一个文档的开头,并且无法找出下一个文档的起始位置。

So, you will need to control the file, and feed the data to iterparse . 因此,您将需要控制文件,并将数据提供给iterparse There are two ways to do this: 有两种方法可以做到这一点:


First, you can create your own file wrapper object that provides all the file-like methods that ET needs, and pass that to ET.iterparse . 首先,您可以创建自己的文件包装器对象,该对象提供ET所需的所有类似文件的方法,并将其传递给ET.iterparse That way, you can keep track of how far into the file iterparse reads, and then start the next parse at that offset. 这样,您可以跟踪iterparse读取到文件的iterparse ,然后在该偏移量处开始下一个解析。

It isn't exactly documented what file-like methods iterparse needs, but as the source shows, all you need is read(size) (and you're allowed to return fewer than size bytes, just as a real file could) and close() , so that's not hard at all. 它没有确切记录iterparse需要什么类似文件的方法,但是正如显示的那样,您所需要的只是read(size) (并且允许返回的字节数少于实际文件的size )并close() ,所以一点也不难。


Alternatively, you can drop down a level and use an ET.XMLParser directly. 或者,您可以下拉级别并直接使用ET.XMLParser That sounds scary, but it's not that bad—look how short iterparse 's source is, and how little of what it's doing you actually need. 这听起来很吓人,但还不算太糟-看看iterparse的来源有多短,以及您实际需要做什么。

Anyway, it comes down to something like this (pseudocode, not tested): 无论如何,它归结为这样的东西(伪代码,未经测试):

class Target(object):
    def __init__(self):
        self.start_tag = None
        self.builder = ET.TreeBuilder()
        self.tree = None
    def start(self, tag, attrib):
        if self.start_tag is None:
            self.start_tag = tag
        return self.builder.start(tag, attrib)
    def end(self, tag):
        ret = self.builder.end(tag, attrib)
        if self.start_tag == tag:
            self.tree = self.builder.close()
            return self.tree
        return ret
    def data(self, data):
        return self.builder.data(data)
    def close(self):
        if self.tree is None:
            self.tree = self.builder.close()
        return self.tree

parser = None
for line in inputFile:
    if parser is None:
        target = Target()
        parser = ET.XMLParser(target=target)
    parser.feed(line)
    if target.tree:
        do_stuff_with(target.tree)
        parser = None

Just create a string with the root/end root surrounding: 只需创建一个以根/结尾为根的字符串:

with open('yourfile') as fin:
    xml_data = '<{0}>{1}</{0}>'.format('rootnode', fin.read())

Then use ET.fromstring(xml_data) 然后使用ET.fromstring(xml_data)

The problem here is pretty simple. 这里的问题很简单。

ET.parse takes a filename (or file object). ET.parse采用文件名(或文件对象)。 But you're passing it a list of lines. 但是您要向它传递行列表。 That's not a filename. 那不是文件名。 The reason you get this error: 您收到此错误的原因:

TypeError: coercing to Unicode: need string or buffer, list found

… is that it's trying to use your list as if it were a string, which doesn't work. …是它试图像使用字符串一样使用您的列表,这是行不通的。

When you've already read the file in, you can use ET.fromstring . 读完文件后,可以使用ET.fromstring However, you have to read it into a string , not a list of strings. 但是,您必须将其读取为一个字符串 ,而不是字符串列表。 For example: 例如:

def readXML (inputFile) : #inputFile is sys.stdin
    f= '<XML>' + inputFile.read() + '</XML>'
    newXML = ET.fromstring(f)
    print newXML.getroot().tag

Or, if you're using Python 3.2 or later, you can use ET.fromstringlist , which takes a sequence of strings—exactly what you have. 或者,如果您使用的是Python 3.2或更高版本,则可以使用ET.fromstringlist ,它需要一个字符串序列-正是您所拥有的。


From your side issue: 从您的角度来看:

Another problem that I just realized while typing this is that my input file has multiple inputs. 在键入此命令时,我刚刚意识到的另一个问题是我的输入文件具有多个输入。 Say, at least more than 10 of the first XML that I wrote. 说,至少我写的第一个XML中有10个以上。 If I do readlines(), isn't that going to read the whole XML ? 如果我执行readlines(),那不是要读取整个XML吗?

Yes, it will. 是的,它会的。 There's never any good reason to use readlines() . 从来没有任何理由使用readlines()

But I'm not sure why that's a problem here. 但是我不确定为什么这是一个问题。

If you're trying to combine a forest of 10 trees into one big tree, you pretty much have the read the whole thing in, right? 如果您试图将10棵树的森林合并为一棵大树,那么您几乎已经读懂了整个内容,对吗?

Unless you change the way you do things. 除非您更改工作方式。 The easy way to do this is to put your own trivial parser—something that splits the file on blank lines—in front of ET. 执行此操作的简单方法是将自己的琐碎解析器(将文件拆分为空白行)放在ET的前面。 For example: 例如:

while True:
    lines = iter(inputFile.readline, '')
    if not lines:
        break
    xml = ET.fromstringlist(lines)
    # do stuff with this tree

You have multiple xml fragments that are separated by a blank line. 您有多个以空白行分隔的xml片段。 To make each fragment a well-formed xml document you need at least to wrap them in a root element. 为了使每个片段成为格式正确的xml文档,您至少需要将它们包装在根元素中。 Building on fromstringlist code example from @abarnert's answer : 建立在fromstringlist 的答案中的 fromstringlist代码示例中:

from xml.etree.cElementTree import XMLParser

def parse_multiple(lines):
    for line in lines:
        parser = XMLParser()
        parser.feed("<root>")      # start of xml document
        while line.strip():        # while non-blank line
            parser.feed(line)      # continue xml document
            line = next(lines, "") # get next line
        parser.feed("</root>")     # end of xml document
        yield parser.close() # yield root Element of the xml tree

It yields xml trees (their root elements ). 它产生xml树(它们的根元素 )。

Example : 范例

import sys
import xml.etree.cElementTree as etree

for root in parse_multiple(sys.stdin):
    etree.dump(root)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM