使用ElementTree在Python中解析嵌套的XML数据

Question

Updates: Towards a Solution 更新：寻求解决方案

This code: 这段代码：

tree = ET.parse(assetsfilename)
root = tree.getroot()
assets = {}

def find_rows(rowset, container):
    for row in rowset.findall("row"):
        singleton = int((row.get('singleton')))
        flag = int((row.get('flag')))
        quantity = int((row.get('quantity')))
        typeID = int((row.get('typeID')))
        locationID = int((row.get('locationID', '0')))
        itemID = int((row.get('itemID')))
        dkey = (singleton, flag, quantity, typeID, locationID, itemID)

        container[dkey] = {}
        child_rowset = row.find("rowset")
        if child_rowset is not None:
            find_rows(child_rowset, container[dkey])

first_rowset = root.find('.//rowset[@name="assets"]')
find_rows(first_rowset, assets)
#print singleton, flag, quantity, typeID, locationID, itemID
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(assets)

Gives this output: 给出以下输出：

{   (0, 4, 1, 3317, 61000419, 1000913922710L): {   },
    (0, 4, 1, 6159, 60003463, 1007025519384L): {   },
    (0, 4, 1, 7669, 60000361, 1007215573625L): {   },
    (0, 4, 1, 23566, 61000419, 1000992661686L): {   },
    (1, 4, 1, 51, 60001345, 1004073218074L): {   },
    (1, 4, 1, 51, 60001345, 1004073218075L): {   },
    (1, 4, 1, 596, 60003337, 1007908184113L): {   (0, 5, 1, 34, 0, 1007908184132
L): {   },
                                                  (1, 27, 1, 3634, 0, 1007908184
129L): {   },
                                                  (1, 28, 1, 3651, 0, 1007908184
130L): {   }},
    (1, 4, 1, 3766, 61000419, 1000973178550L): {   (0, 5, 25, 16273, 0, 10009731
88870L): {   },
                                                   (1, 27, 1, 21096, 0, 10006872
93796L): {   }}}

This basically adds a nested dict to the end of the dict I already had and fills it with the data from the children, if present. 基本上，这将嵌套的dict添加到我已经拥有的dict的末尾，并用子代的数据（如果存在）填充它。 Ideally, though, both the parent and the children data would be in the main dict and the extra field at the end of the dict would contain the itemID of the parent (if that row is a child row) or be empty (if that item is a parent row or a row that doesn't have any children.) 不过，理想情况下，父级和子级数据都应位于主字典中，而该字典末尾的额外字段将包含父级的itemID（如果该行是子行）或为空（如果该项目是父行或没有任何子行的行。）

The Question 问题

I am trying to read in the data from a nested .xml file into some sort of dictionary so that I can output it in other formats (my current goal is sqlite3 and an sqlite .db file, but this isn't the point of my question.) I can read all of the primary level of the data but I can't figure out how to also read in the nested data (if present.) 我试图将嵌套的.xml文件中的数据读入某种字典中，以便以其他格式输出（我目前的目标是sqlite3和sqlite .db文件，但这不是我的意思。问题。）我可以读取数据的所有主要级别，但无法弄清楚如何也读取嵌套数据（如果存在）。

The Data 数据

Here is a sample .xml file: 这是一个示例.xml文件：

<?xml version='1.0' encoding='UTF-8'?>
<eveapi version="2">
  <currentTime>2012-11-14 03:26:35</currentTime>
  <result>
    <rowset name="assets" key="itemID" columns="itemID,locationID,typeID,quantity,flag,singleton">
      <row itemID="1007215573625" locationID="60000361" typeID="7669" quantity="1" flag="4" singleton="0" />
      <row itemID="1004073218074" locationID="60001345" typeID="51" quantity="1" flag="4" singleton="1" rawQuantity="-1" />
      <row itemID="1004073218075" locationID="60001345" typeID="51" quantity="1" flag="4" singleton="1" rawQuantity="-1" />
      <row itemID="1007908184113" locationID="60003337" typeID="596" quantity="1" flag="4" singleton="1" rawQuantity="-1">
        <rowset name="contents" key="itemID" columns="itemID,typeID,quantity,flag,singleton">
          <row itemID="1007908184129" typeID="3634" quantity="1" flag="27" singleton="1" rawQuantity="-1" />
          <row itemID="1007908184130" typeID="3651" quantity="1" flag="28" singleton="1" rawQuantity="-1" />
          <row itemID="1007908184132" typeID="34" quantity="1" flag="5" singleton="0" />
        </rowset>
      </row>
      <row itemID="1007025519384" locationID="60003463" typeID="6159" quantity="1" flag="4" singleton="0" />
      <row itemID="1000913922710" locationID="61000419" typeID="3317" quantity="1" flag="4" singleton="0" />
      <row itemID="1000973178550" locationID="61000419" typeID="3766" quantity="1" flag="4" singleton="1" rawQuantity="-1">
        <rowset name="contents" key="itemID" columns="itemID,typeID,quantity,flag,singleton">
          <row itemID="1000687293796" typeID="21096" quantity="1" flag="27" singleton="1" rawQuantity="-1" />
          <row itemID="1000973188870" typeID="16273" quantity="25" flag="5" singleton="0" />
        </rowset>
      </row>
      <row itemID="1000992661686" locationID="61000419" typeID="23566" quantity="1" flag="4" singleton="0" />
    </rowset>
  </result>
  <cachedUntil>2012-11-14 07:05:29</cachedUntil>
</eveapi>

Note how some items have children items nested under them but some don't and the number of children (if present) is not fixed (so one item can have 3 children and another 2 children while many others have no children at all.) 请注意，有些物品是如何在其下嵌套子物品的，而有些物品却不是，并且子物品的数目（如果存在）不是固定的（因此，一个物品可以有3个孩子，另外2个孩子，而许多其他物品根本没有孩子。）

(For those curious, this data comes from the full id key Asset List API pull from the online game called EVE Online.) （对于那些好奇的人，此数据来自在线游戏EVE Online的完整ID密钥资产列表API提取。）

What I Can Get 我能得到什么

I can get this code: 我可以得到以下代码：

import xml.etree.ElementTree as ET

tree = ET.parse(assetsfilename)
root = tree.getroot()

singleton = []
flag = []
quantity = []
typeID = []
locationID = []
itemID = []
assets = {}
for row in root.findall(".//*[@name='assets']/row"):
    singleton.append (int((row.get('singleton'))))
    flag.append (int((row.get('flag'))))
    quantity.append (int((row.get('quantity'))))
    typeID.append (int((row.get('typeID'))))
    locationID.append (int((row.get('locationID'))))
    itemID.append (int((row.get('itemID'))))
assets = zip(singleton, flag, quantity, typeID, locationID, itemID)
print singleton, flag, quantity, typeID, locationID, itemID
print assets

To output this on the screen: 要在屏幕上输出：

[0, 1, 1, 1, 0, 0, 1, 0] [4, 4, 4, 4, 4, 4, 4, 4] [1, 1, 1, 1, 1, 1, 1, 1] [7669
, 51, 51, 596, 6159, 3317, 3766, 23566] [60000361, 60001345, 60001345, 60003337,
 60003463, 61000419, 61000419, 61000419] [1007215573625L, 1004073218074L, 100407
3218075L, 1007908184113L, 1007025519384L, 1000913922710L, 1000973178550L, 100099
2661686L]
[(0, 4, 1, 7669, 60000361, 1007215573625L), (1, 4, 1, 51, 60001345, 100407321807
4L), (1, 4, 1, 51, 60001345, 1004073218075L), (1, 4, 1, 596, 60003337, 100790818
4113L), (0, 4, 1, 6159, 60003463, 1007025519384L), (0, 4, 1, 3317, 61000419, 100
0913922710L), (1, 4, 1, 3766, 61000419, 1000973178550L), (0, 4, 1, 23566, 610004
19, 1000992661686L)]

Note how this is reading in all the main level lines that start <row itemID= but it doesn't get the nested lines (which I would preferably like to also show somehow as being tied to the parent itemID above it.) 请注意，这是如何在所有以<row itemID=开始的主级行中读取的，但是它没有获得嵌套行（我最好也希望以某种方式将其显示为与其上方的父itemID绑定。）

Desired Output 期望的输出

I'm somewhat open to suggestions here, but this is one option. 我在这里对建议有些开放，但这是一种选择。 I could have the main level rows parsed into a dict (as I already have) and then create another dict that contains the data from the sublevel rows and adds in an extra piece that notes which itemID it is a child of. 我可以将主要级别的行解析为一个dict（就像我已经拥有的一样），然后创建另一个dict，其中包含来自子级别行的数据，并添加一个额外的片段，以指出它是哪个itemID的子对象。 Another option would be to add the data from the sub-level rows into the main dict that I can already make and just add in an extra field the is something like Null or None for items that don't have a parent and gives the parent's itemID for items that do have a parent. 另一个选择是将子级行中的数据添加到我已经可以创建的主字典中，然后在一个额外的字段中添加一个类似Null或None的没有父项的项目，并为其提供父项的项。具有父项的项目的itemID。

Answer 1

This snippet (somewhat larger) recursively parses the xml structure into nested dictionaries, like you described a possible solution. 该代码段（稍大一些）将xml结构递归解析为嵌套字典，就像您描述了可能的解决方案一样。 It works with the sample you provided, but I think it will work with live data anyway. 它可以与您提供的示例一起使用，但是我认为它仍然可以与实时数据一起使用。 If nothing else, you can use the idea. 如果没有别的，您可以使用这个想法。

UPDATE: Ok, this updated version stores itemID as key, and adds parent_id as additional dict attribute, check it out if that's the desired behavior: 更新：好的，此更新的版本将itemID存储为键，并添加parent_id作为其他dict属性，请检查是否这是所需的行为：

import xml.etree.ElementTree as ET

from StringIO import StringIO
tree = ET.parse(StringIO(xml_data))
root = tree.getroot()

assets = {}

def find_rows(rowset, parent_id):
    for row in rowset.findall("row"):
        singleton = int((row.get('singleton')))
        flag = int((row.get('flag')))
        quantity = int((row.get('quantity')))
        typeID = int((row.get('typeID')))
        locationID = int((row.get('locationID', '0')))
        itemID = int((row.get('itemID')))

        assets[itemID] = {'singleton': singleton,
                          'flag': flag,
                          'quantity': quantity,
                          'typeID': typeID,
                          'locationID': locationID,
                          'parentID': parent_id}
        child_rowset = row.find("rowset")
        if child_rowset is not None:
            find_rows(child_rowset, itemID)

first_rowset = root.find('.//rowset[@name="assets"]')
find_rows(first_rowset, None)

使用ElementTree在Python中解析嵌套的XML数据

问题描述

Updates: Towards a Solution 更新：寻求解决方案

The Question 问题

The Data 数据

What I Can Get 我能得到什么

Desired Output 期望的输出

1 个解决方案

解决方案1
1 已采纳 2012-11-14 07:14:02

使用ElementTree在Python中解析嵌套的XML数据

问题描述

Updates: Towards a Solution 更新：寻求解决方案

The Question 问题

The Data 数据

What I Can Get 我能得到什么

Desired Output 期望的输出

1 个解决方案

解决方案1 1 已采纳 2012-11-14 07:14:02

解决方案1
1 已采纳 2012-11-14 07:14:02