在python中解析多个子元素

Question

我正在尝试解析XML并将元素值存储到一个对象中。 我遇到的问题是子元素被重复，因此我不确定在其上进行迭代并存储值的最佳实践。

我正在考虑做的是查看一个子元素并添加一个计数器。 该计数器将用于创建不确定数量的对象容器来存储值。 这项工作还是有更好的方法呢？

这是我班的一个例子：

SODOCUMENTITEMS类：

def __init__(self):
    self.recordno = ''
    self.dochdrno = ''
    self.docid = ''

这是我的XML的示例：

`<sotransitems>
  <sotransitem>
    <recordno>40562</recordno>
    <dochdrno>16987</dochdrno>
    <docid/>
    <bundlenumber/>
    <itemid>13</itemid>
    <itemdesc>Winter Lager</itemdesc>
    <line_no>0</line_no>
    <warehouseid>Main</warehouseid>
    <quantity>1</quantity>
    <unit>Each</unit>
    <price>4.99</price>
    <retailprice>4.99</retailprice>
    <totalamount>4.99</totalamount>
    <taxrate/>
    <tax/>
    <grossamount/>
    <locationid/>
    <departmentid/>
    <memo/>
    <discsurchargememo/>
    <revrectemplate/>
    <revrecstartdate>
      <year></year>
      <month></month>
      <day></day>
    </revrecstartdate>
    <revrecenddate>
      <year></year>
      <month></month>
      <day></day>
    </revrecenddate>
    <renewalmacro/>
    <currency>USD</currency>
    <exchratedate>
      <year></year>
      <month></month>
      <day></day>
    </exchratedate>
    <exchratetype/>
    <exchrate>1</exchrate>
    <trx_price>4.99</trx_price>
    <trx_value>4.99</trx_value>
    <projectid/>
    <customerid>2--2</customerid>
    <vendorid/>
    <employeeid/>
    <classid/>
    <contractid/>
    <taskno/>
    <billingtemplate/>
    <sourcedocumentid/>
    <sourcedocumentkey/>
    <sourcedocumententrytkey/>
    <discountpercent/>
    <linesubtotals/>
    <customfields>
      <customfield>
        <customfieldname>TESTCUSTOM</customfieldname>
        <customfieldvalue>true</customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>TEST_NUMBER</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>NUMBER1</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>TEST_DATE</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>MYTESTFIELD</customfieldname>
        <customfieldvalue>true</customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>TESTBOX1</customfieldname>
        <customfieldvalue>true</customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMUSERDEFINEDDEMTSS</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMA</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMA777</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMAA5678</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMSITE</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
    </customfields>
  </sotransitem>
  <sotransitem>
    <recordno>40563</recordno>
    <dochdrno>16987</dochdrno>
    <docid/>
    <bundlenumber/>
    <itemid>12</itemid>
    <itemdesc>Loktar</itemdesc>
    <line_no>1</line_no>
    <warehouseid>Main</warehouseid>
    <quantity>1</quantity>
    <unit>Each</unit>
    <price>90</price>
    <retailprice>90</retailprice>
    <totalamount>90</totalamount>
    <taxrate/>
    <tax/>
    <grossamount/>
    <locationid/>
    <departmentid>fail</departmentid>
    <memo/>
    <discsurchargememo/>
    <revrectemplate/>
    <revrecstartdate>
      <year></year>
      <month></month>
      <day></day>
    </revrecstartdate>
    <revrecenddate>
      <year></year>
      <month></month>
      <day></day>
    </revrecenddate>
    <renewalmacro/>
    <currency>USD</currency>
    <exchratedate>
      <year></year>
      <month></month>
      <day></day>
    </exchratedate>
    <exchratetype/>
    <exchrate>1</exchrate>
    <trx_price>90</trx_price>
    <trx_value>90</trx_value>
    <projectid/>
    <customerid>2--2</customerid>
    <vendorid/>
    <employeeid/>
    <classid/>
    <contractid/>
    <taskno/>
    <billingtemplate/>
    <sourcedocumentid/>
    <sourcedocumentkey/>
    <sourcedocumententrytkey/>
    <discountpercent/>
    <linesubtotals/>
    <customfields>
      <customfield>
        <customfieldname>TESTCUSTOM</customfieldname>
        <customfieldvalue>true</customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>TEST_NUMBER</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>NUMBER1</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>TEST_DATE</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>MYTESTFIELD</customfieldname>
        <customfieldvalue>true</customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>TESTBOX1</customfieldname>
        <customfieldvalue>true</customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMUSERDEFINEDDEMTSS</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMA</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMA777</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMAA5678</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
      <customfield>
        <customfieldname>GLDIMSITE</customfieldname>
        <customfieldvalue></customfieldvalue>
      </customfield>
    </customfields>
  </sotransitem>
</sotransitems>`

我只是在寻找有关如何最好地解析并将每个集合存储到一个对象的小样本或建议。 任何信息都会有所帮助，我会根据您的反馈进行其他研究。

谢谢！

Answer 1

解析XML数据的主要方法有：

DOM解析器。
他们将完整的xml文件加载到内存中并构建DOM（文档对象模型）。 它允许程序员使用许多不错的技术在文档中导航或从文档中检索数据（即XPath ， xslt转换， xml-schema到类转换）。 该技术的不足之处在于它可能需要大量的内存，并且速度可能很慢（取决于解析器，dom模型，dom中的索引...）。

在实施例I中删除从某些字段sotransitem和customfields为了简单起见。

例：

类定义：

 class Sotransitem:

    recordno = None
    unit = None
    customfields = None

    def __init__( self ):
        self.recordno
        self.unit
        self.customfields = {}

    def __repr__( self ):
        return "Item( rec_no: {rec}, fields: {fields} )".format( rec=self.recordno,
                                                                 fields = str( self.customfields ) )

在这里，我将使用standart python库，但您还应该查看其他库。 据我所知，最受欢迎的是lxml，BeautifulSoup。

实际解析器：

import xml.etree.ElementTree as ET

tree = ET.parse( 'test.xml' )
root = tree.getroot()

all_items = []

for node in root.findall( 'sotransitem' ):
    item = Sotransitem()
    item.recordno = int( node.find( 'recordno' ).text )
    item.unit = node.find('unit').text

    for custom_node in node.findall('./customfields/customfield'):
        value = custom_node.find('customfieldvalue').text
        name = custom_node.find('customfieldname').text
        item.customfields[ name ] = value

    all_items.append( item )

print( all_items ) 
# [Item( rec_no: 40562, fields: {'TEST_NUMBER': None, 'TESTCUSTOM': 'true'} ), Item( rec_no: 40563, fields: {'TESTCUSTOM': 'true', 'NUMBER1': None} )

它可以满足我的大多数需求，但是使用xml-schema它将更加简单。 检查lxml“评估架构”示例

SAX解析器。 细读xml，当找到标签（开始或结束标签）时，它会触发一个事件，其中包含找到的标签及其数据（如果是关闭标签）。 一旦报告，SAX解析器通常会丢弃几乎所有信息（但是，它确实保留了一些东西，例如，所有尚未关闭的元素的列表）。
优点：SAX解析器需要恒定数量的RAM，远不及DOM。
缺点：不可能使用大多数XML技术。

例：

all_items = []

# get the root element
nodes_parser = ET.iterparse( 'test.xml', ["start", "end"] )
event, root = next( nodes_parser )

item = None

for event, node in nodes_parser:
    if( event=="start" and node.tag == "sotransitem" ):
        if item is not None:
            all_items.append( item )
        item = Sotransitem()
        sotrans_node = node;

    elif event == "end":
        tag = node.tag
        if tag == "recordno":
            item.recordno = int( node.text )
        elif  tag == "unit":
            item.unit = node.text

        elif tag == 'customfield':
            value = node.find('customfieldvalue').text
            name = node.find('customfieldname').text
            item.customfields[ name ] = value

        sotrans_node.clear() #other wise it will be ceeped in "node" until "end" event on "sotransitem"
    else:
        sotrans_node.clear()
    root.clear() # same as before but for root 

if item is not None:
    all_items.append( item )

print( all_items )
#same resutl as before

选择哪种方法取决于XML文件中存储的数据量。

如果它只是从小文件中检索一些数据的简单脚本（曾经写过一次，不久就会被使用），请使用DOM。

如果它是配置文件或服务器之间几兆字节长的小消息：带有自动xml到类转换的DOM可能是最好的。

如果您的数据太大而无法保留在服务器内存中（即OpenStreeMap world.xml），或者一次解析的消息太多，则应该选择SAX。

在python中解析多个子元素

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-06-24 01:37:51

在python中解析多个子元素

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-06-24 01:37:51

解决方案1
0 已采纳 2016-06-24 01:37:51