使用 lxml SAX 解析大型 xml 文件

Question

我有一個巨大的 xml 文件，看起來像這樣

<environment>
    <category name='category1'>
        <peoples>
            <people>
                <name>Mary</name>
                <city>NY</city>
                <age>10</age>
            </people>
            <people>
                <name>Jane</name>
                <city>NY</city>
                <age>19</age>
            </people>
            <people>
                <name>John</name>
                <city>NY</city>
                <age>20</age>
            </people>
            <people>
                <name>Carl</name>
                <city>DC</city>
                <age>11</age>
            </people>
            ...
        </people>
    </category>
    <category name='category2'>
    ...
    </category
</environment>

我想將 xml 文件和輸出解析為一個字典，其中鍵是類別的名稱（示例中的 category1、category2）和每個類別可能不同的值字典。 現在我只對類別 1 感興趣，我想在其中形成一個字典，其中鍵是名稱，值是年齡，它只包含居住在 city = NY 的人

所以最終輸出將是這樣的：

{ 'cateogory1': { 'Mary': 10, 'Jane': 19, 'John': 20 }, 'cateogory2': {} }

我首先嘗試使用 iterparse 但出現內存錯誤：

result = {}
for _, element in etree.iterparse( 'file.xml', tag = 'category' ):
    result[element.get('name')] = {}
    if element.get('name') == 'category':
        persons = {}
        for person in element.findall('peoples/people'):
            name, city, age = person.getchildren()
            if city.text == 'NY':
                persons[ name.text ] = age.text
        result[element.get('name')] = persons
    element.clear()

return results

所以我的第二次嘗試是使用 SAX，但我不熟悉它。 我首先從這里拿了一個腳本，但找不到將名字與一個人的城市和年齡相關聯的方法：

class CategoryParser(object):
    def __init__( self, d ):
        self.d = d
    def start( self, start, attrib ):
        if tag == 'category':
            self.group = self.d[attrib['name']] = {}
        elif tag == 'people':
            #don't know how to access name, city and age for this person
    def close( self ):
        pass

result = {}
parser = lxml.etree.XMLParser( target=CategoryParser(result) )
lxml.etree.parse( 'file.xml', parser )

實現預期結果的最佳方式是什么？ 我願意使用其他方法。

Answer 1

您的lxml方法看起來非常接近，但我不確定為什么它會給出MemoryError 。 不過，您可以使用內置的xml.etree.ElementTree輕松完成此操作。

使用此 xml（從您的示例稍作修改）：

xml = '''<environment>
    <category name='category1'>
        <peoples>
            <people>
                <name>Mary</name>
                <city>NY</city>
                <age>10</age>
            </people>
            <people>
                <name>Jane</name>
                <city>NY</city>
                <age>19</age>
            </people>
            <people>
                <name>John</name>
                <city>NY</city>
                <age>20</age>
            </people>
            <people>
                <name>Carl</name>
                <city>DC</city>
                <age>11</age>
            </people>
        </peoples>
    </category>
    <category name='category2'>
        <peoples>
            <people>
                <name>Mike</name>
                <city>NY</city>
                <age>200</age>
            </people>
            <people>
                <name>Jimmy</name>
                <city>HW</city>
                <age>94</age>
            </people>
        </peoples>
    </category>
</environment>'''

我這樣做：

import xml.etree.ElementTree as ET

root = ET.fromstring(xml)

x = dict()

# Iterate all "category" nodes
for c in root.findall('./category'):

    # Store "name" attribute
    name = c.attrib['name']

    # Insert empty dictionary for current category
    x[name] = {}

    # Iterate all people nodes contained in this category that have
    # a child "city" node matching "NY"
    for p in c.findall('./peoples/people[city="NY"]'):

        # Get text of "name" child node
        # (accessed by iterating parent node)
        # i.e. "list(p)" -> [<Element 'name' at 0x04BB2750>, <Element 'city' at 0x04BB2900>, <Element 'age' at 0x04BB2A50>])
        person_name = next(e for e in p if e.tag == 'name').text

        # Same for "age" node, and convert to int
        person_age = int(next(e for e in p if e.tag == 'age').text)

        # Add entry to current category dictionary
        x[name][person_name] = person_age

這給了我以下字典：

{'category1': {'Mary': 10, 'Jane': 19, 'John': 20}, 'category2': {'Mike': 200}}

此外，關於您的示例 xml 的一些注釋（可能只是復制/粘貼工件，但以防萬一）：

您的關閉/peoples節點缺少“s”
您的最后一個關閉/category節點缺少一個關閉 ">"

Answer 2

由於您使用lxml並指示open 以使用其他方法，請考慮XSLT ，這是一種專門用於將 XML 文檔轉換為各種格式（包括文本文件）的專用語言。

具體來說，沿着樹向下走，並按節點值構建所需的大括號和引號。 因為您需要的字典可以是有效的 JSON，所以將您的 XSLT 結果導出為 .json！

XSLT （另存為 .xsl 文件，特殊的 .xml 文件）

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" method="text"/>
  <xsl:strip-space elements="*"/>

  <xsl:variable name="pst">&apos;</xsl:variable>

  <xsl:template match="/environment">
      <xsl:text>{&#xa;</xsl:text>
      <xsl:apply-templates select="category"/>
      <xsl:text>&#xa;}</xsl:text>
  </xsl:template>

  <xsl:template match="category">
      <xsl:value-of select="concat('  ', $pst, @name, $pst, ': {')"/>
      <xsl:apply-templates select="peoples/people[city='NY']"/>
      <xsl:text>}</xsl:text>
      <xsl:if test="position() != last()">
          <xsl:text>,&#xa;</xsl:text>
      </xsl:if>
  </xsl:template>

  <xsl:template match="people">
      <xsl:value-of select="concat($pst, name, $pst, ': ', age)"/>
      <xsl:if test="position() != last()">
          <xsl:text>, </xsl:text>
      </xsl:if>
  </xsl:template>

</xsl:stylesheet>

Python （沒有for循環、 if邏輯或def構建）

import ast
import lxml.etree as et

# LOAD XML AND XSL
xml = et.parse('Input.xml')
xsl = et.parse('Script.xsl')

# TRANSFORM INPUT
transformer = et.XSLT(xsl)
output_str = transformer(xml)

# BUILD DICT LITERALLY
new_dict = ast.literal_eval(str(output_str))

print(new_dict)
# {'category1': {'Mary': 10, 'Jane': 19, 'John': 20} }

# OUTPUT JSON
with open('Output.json', 'wb') as f:
   f.write(output_str)

# {
#   "category1": {"Mary": 10, "Jane": 19, "John": 20}
# }

在線演示（擴展節點進行演示）

使用 lxml SAX 解析大型 xml 文件

問題描述

2 個解決方案

解決方案1
0 2019-12-03 20:16:21

解決方案2
0 2019-12-03 20:47:25

使用 lxml SAX 解析大型 xml 文件

問題描述

2 個解決方案

解決方案1 0 2019-12-03 20:16:21

解決方案2 0 2019-12-03 20:47:25

解決方案1
0 2019-12-03 20:16:21

解決方案2
0 2019-12-03 20:47:25