简体   繁体   English

使用 lxml SAX 解析大型 xml 文件

[英]Parsing large xml file with lxml SAX

I have a huge xml file that looks like this我有一个巨大的 xml 文件,看起来像这样

<environment>
    <category name='category1'>
        <peoples>
            <people>
                <name>Mary</name>
                <city>NY</city>
                <age>10</age>
            </people>
            <people>
                <name>Jane</name>
                <city>NY</city>
                <age>19</age>
            </people>
            <people>
                <name>John</name>
                <city>NY</city>
                <age>20</age>
            </people>
            <people>
                <name>Carl</name>
                <city>DC</city>
                <age>11</age>
            </people>
            ...
        </people>
    </category>
    <category name='category2'>
    ...
    </category
</environment>

I want to parse the xml file and the output to be a dictionary where the keys are the names of the categories ( category1, category2 in the example ) and the values dictionaries that may be different for each category.我想将 xml 文件和输出解析为一个字典,其中键是类别的名称(示例中的 category1、category2)和每个类别可能不同的值字典。 For now I'm only interested in the category 1, where I want to form a dictionary where the keys are names, values are ages and it just contains people that lives in city = NY现在我只对类别 1 感兴趣,我想在其中形成一个字典,其中键是名称,值是年龄,它只包含居住在 city = NY 的人

So final output will be something like this:所以最终输出将是这样的:

{ 'cateogory1': { 'Mary': 10, 'Jane': 19, 'John': 20 }, 'cateogory2': {} } { 'cateogory1': { 'Mary': 10, 'Jane': 19, 'John': 20 }, 'cateogory2': {} }

I tried first with iterparse but got a memory error:我首先尝试使用 iterparse 但出现内存错误:

result = {}
for _, element in etree.iterparse( 'file.xml', tag = 'category' ):
    result[element.get('name')] = {}
    if element.get('name') == 'category':
        persons = {}
        for person in element.findall('peoples/people'):
            name, city, age = person.getchildren()
            if city.text == 'NY':
                persons[ name.text ] = age.text
        result[element.get('name')] = persons
    element.clear()

return results

So my second attempt was to use SAX but I'm not familiar with it.所以我的第二次尝试是使用 SAX,但我不熟悉它。 I started by taking a script from here but couldn't find a way to associate the name with the city and age of a person:我首先从这里拿了一个脚本,但找不到将名字与一个人的城市和年龄相关联的方法:

class CategoryParser(object):
    def __init__( self, d ):
        self.d = d
    def start( self, start, attrib ):
        if tag == 'category':
            self.group = self.d[attrib['name']] = {}
        elif tag == 'people':
            #don't know how to access name, city and age for this person
    def close( self ):
        pass

result = {}
parser = lxml.etree.XMLParser( target=CategoryParser(result) )
lxml.etree.parse( 'file.xml', parser )

What will be the best way of achieving the wanted result?实现预期结果的最佳方式是什么? I'm open to use others approaches.我愿意使用其他方法。

Your lxml approach looked pretty close, but I'm not sure why it's giving the MemoryError .您的lxml方法看起来非常接近,但我不确定为什么它会给出MemoryError You can do this pretty easily with the built in xml.etree.ElementTree though.不过,您可以使用内置的xml.etree.ElementTree轻松完成此操作。

Using this xml (slightly modified from your sample):使用此 xml(从您的示例稍作修改):

xml = '''<environment>
    <category name='category1'>
        <peoples>
            <people>
                <name>Mary</name>
                <city>NY</city>
                <age>10</age>
            </people>
            <people>
                <name>Jane</name>
                <city>NY</city>
                <age>19</age>
            </people>
            <people>
                <name>John</name>
                <city>NY</city>
                <age>20</age>
            </people>
            <people>
                <name>Carl</name>
                <city>DC</city>
                <age>11</age>
            </people>
        </peoples>
    </category>
    <category name='category2'>
        <peoples>
            <people>
                <name>Mike</name>
                <city>NY</city>
                <age>200</age>
            </people>
            <people>
                <name>Jimmy</name>
                <city>HW</city>
                <age>94</age>
            </people>
        </peoples>
    </category>
</environment>'''

I do this:我这样做:

import xml.etree.ElementTree as ET

root = ET.fromstring(xml)

x = dict()

# Iterate all "category" nodes
for c in root.findall('./category'):

    # Store "name" attribute
    name = c.attrib['name']

    # Insert empty dictionary for current category
    x[name] = {}

    # Iterate all people nodes contained in this category that have
    # a child "city" node matching "NY"
    for p in c.findall('./peoples/people[city="NY"]'):

        # Get text of "name" child node
        # (accessed by iterating parent node)
        # i.e. "list(p)" -> [<Element 'name' at 0x04BB2750>, <Element 'city' at 0x04BB2900>, <Element 'age' at 0x04BB2A50>])
        person_name = next(e for e in p if e.tag == 'name').text

        # Same for "age" node, and convert to int
        person_age = int(next(e for e in p if e.tag == 'age').text)

        # Add entry to current category dictionary
        x[name][person_name] = person_age

Which gives me the following dictionary:这给了我以下字典:

{'category1': {'Mary': 10, 'Jane': 19, 'John': 20}, 'category2': {'Mike': 200}}

Also, a few notes on your sample xml (which may have just been copy/paste artifacts, but just in case):此外,关于您的示例 xml 的一些注释(可能只是复制/粘贴工件,但以防万一):

  • Your closing /peoples node was missing the "s"您的关闭/peoples节点缺少“s”
  • Your last closing /category node was missing a closing ">"您的最后一个关闭/category节点缺少一个关闭 ">"

Since you use lxml and indicated open to use other approaches , consider XSLT , the special-purpose language designed to transform XML documents to various formats including text files.由于您使用lxml并指示open 以使用其他方法,请考虑XSLT ,这是一种专门用于将 XML 文档转换为各种格式(包括文本文件)的专用语言。

Specifically, walk down your tree and build the needed braces and quotes by node values.具体来说,沿着树向下走,并按节点值构建所需的大括号和引号。 And because your needed dictionary can be a valid JSON, export your XSLT result as .json!因为您需要的字典可以是有效的 JSON,所以将您的 XSLT 结果导出为 .json!

XSLT (save as an .xsl file, a special .xml file) XSLT (另存为 .xsl 文件,特殊的 .xml 文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" method="text"/>
  <xsl:strip-space elements="*"/>

  <xsl:variable name="pst">&apos;</xsl:variable>

  <xsl:template match="/environment">
      <xsl:text>{&#xa;</xsl:text>
      <xsl:apply-templates select="category"/>
      <xsl:text>&#xa;}</xsl:text>
  </xsl:template>

  <xsl:template match="category">
      <xsl:value-of select="concat('  ', $pst, @name, $pst, ': {')"/>
      <xsl:apply-templates select="peoples/people[city='NY']"/>
      <xsl:text>}</xsl:text>
      <xsl:if test="position() != last()">
          <xsl:text>,&#xa;</xsl:text>
      </xsl:if>
  </xsl:template>

  <xsl:template match="people">
      <xsl:value-of select="concat($pst, name, $pst, ': ', age)"/>
      <xsl:if test="position() != last()">
          <xsl:text>, </xsl:text>
      </xsl:if>
  </xsl:template>

</xsl:stylesheet>

Python (no for loops, if logic, or def builds) Python (没有for循环、 if逻辑或def构建)

import ast
import lxml.etree as et

# LOAD XML AND XSL
xml = et.parse('Input.xml')
xsl = et.parse('Script.xsl')

# TRANSFORM INPUT
transformer = et.XSLT(xsl)
output_str = transformer(xml)

# BUILD DICT LITERALLY
new_dict = ast.literal_eval(str(output_str))

print(new_dict)
# {'category1': {'Mary': 10, 'Jane': 19, 'John': 20} }

# OUTPUT JSON
with open('Output.json', 'wb') as f:
   f.write(output_str)

# {
#   "category1": {"Mary": 10, "Jane": 19, "John": 20}
# }

Online Demo (with expanded nodes for demonstration)在线演示(扩展节点进行演示)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM