简体   繁体   English

使用xmltree解析大型python xml

[英]Parse large python xml using xmltree

I have a python script that parses huge xml files ( largest one is 446 MB) 我有一个Python脚本,可以解析巨大的xml文件(最大的是446 MB)

    try:
        parser = etree.XMLParser(encoding='utf-8')
        tree = etree.parse(os.path.join(srcDir, fileName), parser)
        root = tree.getroot()
    except Exception, e:
        print "Error parsing file "+str(fileName) + " Reason "+str(e.message)

    for child in root:
        if "PersonName" in child.tag:
            personName = child.text

This is what my xml looks like : 这是我的xml的样子:

<?xml version="1.0" encoding="utf-8"?>
<MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2">
  <Aliases authority="OPP" xmlns="http://www.example.org/yml/data/commonv2">
     <Description>myData</Description>
     <Identifier>43hhjh87n4nm</Identifier>
  </Aliases>
  <RollNo uom="kPa">39979172.201167159</RollNo>
  <PersonName>Miracle Smith</PersonName>
  <Date>2017-06-02T01:10:32-05:00</Date>
....

All I want to do is get the PersonName tags contents thats all. 我要做的就是获取PersonName标签的内容。 Other tags I don't care about. 我不在乎的其他标签。

Sadly My files are huge and I keep getting this error when I use the code above : 不幸的是,我的文件很大,使用上面的代码时,我总是收到此错误消息:

Error parsing file 2eb6d894-0775-e611.xml Reason unknown error, line 1, column 310915857
Error parsing file 2ecc18b5-ef41-e711-80f.xml Reason Extra content at the end of the document, line 1, column 3428182
Error parsing file 2f0d6926-b602-e711-80f4-005.xml Reason Extra content at the end of the document, line 1, column 6162118
Error parsing file 2f12636b-b2f5-e611-80f3-00.xml Reason Extra content at the end of the document, line 1, column 8014679
Error parsing file 2f14e35a-d22b-4504-8866-.xml Reason Extra content at the end of the document, line 1, column 8411238
Error parsing file 2f50c2eb-55c6-e611-80f0-005056a.xml Reason Extra content at the end of the document, line 1, column 7636614
Error parsing file 3a1a3806-b6af-e611-80ef-00505.xml Reason Extra content at the end of the document, line 1, column 11032486

My XML is perfectly fine and has no extra content .Seems that the large files parsing causes the error. 我的XML非常好,没有多余的内容。似乎大文件解析会导致错误。 I have looked at iterparse() but it seems to complex for what I want to achieve as it provides parsing of the whole DOM while I just want that one tag that is under the root. 我已经看过iterparse(),但是它要实现的目标似乎很复杂,因为它提供了整个DOM的解析,而我只希望位于根目录下的一个标记。 Also , does not give me a good sample to get the correct value by tag name ? 另外,不是给我一个很好的示例以按标签名称获取正确的值吗?

Should I use a regex parse or grep /awk way to do this ? 我应该使用正则表达式解析还是grep / awk方式做到这一点? Or any tweak to my code will let me get the Person name in these huge files ? 或对我的代码进行的任何调整都会使我在这些巨大的文件中获得“人名”?

UPDATE: Tried this sample and it seems to be printing the whole world from the xml except my tag ? 更新:尝试过此示例,它似乎正在从xml打印整个世界,除了我的标签?

Does iterparse read from bottom to top of file ? 是否iterparse从文件的底部读取到顶部? In that case it will take a long time to get to the top ie my PersonName Tag ? 在那种情况下,到达顶部即我的PersonName标签将花费很长时间。 I tried changing the line below to read end to start events=("end", "start") and it does the same thing !!! 我尝试更改下面的行以读取end to start events =(“ end”,“ start”),它做同样的事情!!!

path = []
for event, elem in ET.iterparse('D:\\mystage\\2-80ea-005056.xml', events=("start", "end")):
    if event == 'start':
            path.append(elem.tag)
    elif event == 'end':
            # process the tag
            print elem.text  // prints whole world 
            if elem.tag == 'PersonName':
                print elem.text
            path.pop()

Iterparse is not that difficult to use in this case. 在这种情况下,Iterparse并不难使用。

temp.xml is the file presented in your question with a </MyRoot> stuck on as a line at the end. temp.xml是您的问题中显示的文件,最后以</MyRoot>为一行。

Think of the source = as boilerplace, if you will, that parses the xml file and returns chunks of it element-by-element, indicating whether the chunk is the 'start' of an element or the 'end' and supplying information about the element. 如果可以的话,可以将source =视为锅炉库,它将分析xml文件并逐元素返回该数据块,指示该块是元素的“开始”还是“结束”,并提供有关元素的信息。元件。

In this case we need consider only the 'start' events. 在这种情况下,我们只需要考虑“开始”事件。 We watch for the 'PersonName' tags and pick up their texts. 我们注意“ PersonName”标签并拾取其文本。 Having found the one and only such item in the xml file we abandon the processing. 在xml文件中找到了唯一的一项后,我们放弃了处理。

>>> from xml.etree import ElementTree
>>> source = iter(ElementTree.iterparse('temp.xml', events=('start', 'end')))
>>> for an_event, an_element in source:
...     if an_event=='start' and an_element.tag.endswith('PersonName'):
...         an_element.text
...         break
... 
'Miracle Smith'

Edit, in response to question in a comment: 编辑,以回应评论中的问题:

Normally you wouldn't do this since iterparse is intended for use with large chunks of xml. 通常,您不会这样做,因为iterparse旨在用于大块xml。 However, by wrapping a string in a StringIO object it can be processed with iterparse . 但是,通过将字符串包装在StringIO对象中,可以使用iterparse处理iterparse

>>> from xml.etree import ElementTree
>>> from io import StringIO
>>> xml = StringIO('''\
... <?xml version="1.0" encoding="utf-8"?>
... <MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2">
...   <Aliases authority="OPP" xmlns="http://www.example.org/yml/data/commonv2">
...        <Description>myData</Description>
...             <Identifier>43hhjh87n4nm</Identifier>
...               </Aliases>
...                 <RollNo uom="kPa">39979172.201167159</RollNo>
...                   <PersonName>Miracle Smith</PersonName>
...                     <Date>2017-06-02T01:10:32-05:00</Date>
... </MyRoot>''')
>>> source = iter(ElementTree.iterparse(xml, events=('start', 'end')))
>>> for an_event, an_element in source:
...     if an_event=='start' and an_element.tag.endswith('PersonName'):
...         an_element.text
...         break
...     
'Miracle Smith'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM