简体   繁体   中英

Parsing large xml file with lxml SAX

I have a huge xml file that looks like this

<environment>
    <category name='category1'>
        <peoples>
            <people>
                <name>Mary</name>
                <city>NY</city>
                <age>10</age>
            </people>
            <people>
                <name>Jane</name>
                <city>NY</city>
                <age>19</age>
            </people>
            <people>
                <name>John</name>
                <city>NY</city>
                <age>20</age>
            </people>
            <people>
                <name>Carl</name>
                <city>DC</city>
                <age>11</age>
            </people>
            ...
        </people>
    </category>
    <category name='category2'>
    ...
    </category
</environment>

I want to parse the xml file and the output to be a dictionary where the keys are the names of the categories ( category1, category2 in the example ) and the values dictionaries that may be different for each category. For now I'm only interested in the category 1, where I want to form a dictionary where the keys are names, values are ages and it just contains people that lives in city = NY

So final output will be something like this:

{ 'cateogory1': { 'Mary': 10, 'Jane': 19, 'John': 20 }, 'cateogory2': {} }

I tried first with iterparse but got a memory error:

result = {}
for _, element in etree.iterparse( 'file.xml', tag = 'category' ):
    result[element.get('name')] = {}
    if element.get('name') == 'category':
        persons = {}
        for person in element.findall('peoples/people'):
            name, city, age = person.getchildren()
            if city.text == 'NY':
                persons[ name.text ] = age.text
        result[element.get('name')] = persons
    element.clear()

return results

So my second attempt was to use SAX but I'm not familiar with it. I started by taking a script from here but couldn't find a way to associate the name with the city and age of a person:

class CategoryParser(object):
    def __init__( self, d ):
        self.d = d
    def start( self, start, attrib ):
        if tag == 'category':
            self.group = self.d[attrib['name']] = {}
        elif tag == 'people':
            #don't know how to access name, city and age for this person
    def close( self ):
        pass

result = {}
parser = lxml.etree.XMLParser( target=CategoryParser(result) )
lxml.etree.parse( 'file.xml', parser )

What will be the best way of achieving the wanted result? I'm open to use others approaches.

Your lxml approach looked pretty close, but I'm not sure why it's giving the MemoryError . You can do this pretty easily with the built in xml.etree.ElementTree though.

Using this xml (slightly modified from your sample):

xml = '''<environment>
    <category name='category1'>
        <peoples>
            <people>
                <name>Mary</name>
                <city>NY</city>
                <age>10</age>
            </people>
            <people>
                <name>Jane</name>
                <city>NY</city>
                <age>19</age>
            </people>
            <people>
                <name>John</name>
                <city>NY</city>
                <age>20</age>
            </people>
            <people>
                <name>Carl</name>
                <city>DC</city>
                <age>11</age>
            </people>
        </peoples>
    </category>
    <category name='category2'>
        <peoples>
            <people>
                <name>Mike</name>
                <city>NY</city>
                <age>200</age>
            </people>
            <people>
                <name>Jimmy</name>
                <city>HW</city>
                <age>94</age>
            </people>
        </peoples>
    </category>
</environment>'''

I do this:

import xml.etree.ElementTree as ET

root = ET.fromstring(xml)

x = dict()

# Iterate all "category" nodes
for c in root.findall('./category'):

    # Store "name" attribute
    name = c.attrib['name']

    # Insert empty dictionary for current category
    x[name] = {}

    # Iterate all people nodes contained in this category that have
    # a child "city" node matching "NY"
    for p in c.findall('./peoples/people[city="NY"]'):

        # Get text of "name" child node
        # (accessed by iterating parent node)
        # i.e. "list(p)" -> [<Element 'name' at 0x04BB2750>, <Element 'city' at 0x04BB2900>, <Element 'age' at 0x04BB2A50>])
        person_name = next(e for e in p if e.tag == 'name').text

        # Same for "age" node, and convert to int
        person_age = int(next(e for e in p if e.tag == 'age').text)

        # Add entry to current category dictionary
        x[name][person_name] = person_age

Which gives me the following dictionary:

{'category1': {'Mary': 10, 'Jane': 19, 'John': 20}, 'category2': {'Mike': 200}}

Also, a few notes on your sample xml (which may have just been copy/paste artifacts, but just in case):

  • Your closing /peoples node was missing the "s"
  • Your last closing /category node was missing a closing ">"

Since you use lxml and indicated open to use other approaches , consider XSLT , the special-purpose language designed to transform XML documents to various formats including text files.

Specifically, walk down your tree and build the needed braces and quotes by node values. And because your needed dictionary can be a valid JSON, export your XSLT result as .json!

XSLT (save as an .xsl file, a special .xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" method="text"/>
  <xsl:strip-space elements="*"/>

  <xsl:variable name="pst">&apos;</xsl:variable>

  <xsl:template match="/environment">
      <xsl:text>{&#xa;</xsl:text>
      <xsl:apply-templates select="category"/>
      <xsl:text>&#xa;}</xsl:text>
  </xsl:template>

  <xsl:template match="category">
      <xsl:value-of select="concat('  ', $pst, @name, $pst, ': {')"/>
      <xsl:apply-templates select="peoples/people[city='NY']"/>
      <xsl:text>}</xsl:text>
      <xsl:if test="position() != last()">
          <xsl:text>,&#xa;</xsl:text>
      </xsl:if>
  </xsl:template>

  <xsl:template match="people">
      <xsl:value-of select="concat($pst, name, $pst, ': ', age)"/>
      <xsl:if test="position() != last()">
          <xsl:text>, </xsl:text>
      </xsl:if>
  </xsl:template>

</xsl:stylesheet>

Python (no for loops, if logic, or def builds)

import ast
import lxml.etree as et

# LOAD XML AND XSL
xml = et.parse('Input.xml')
xsl = et.parse('Script.xsl')

# TRANSFORM INPUT
transformer = et.XSLT(xsl)
output_str = transformer(xml)

# BUILD DICT LITERALLY
new_dict = ast.literal_eval(str(output_str))

print(new_dict)
# {'category1': {'Mary': 10, 'Jane': 19, 'John': 20} }

# OUTPUT JSON
with open('Output.json', 'wb') as f:
   f.write(output_str)

# {
#   "category1": {"Mary": 10, "Jane": 19, "John": 20}
# }

Online Demo (with expanded nodes for demonstration)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM