简体   繁体   中英

Cleaning up large XML files in Python (stream parse)

I tried to use Python to cleanup some messy XML files, which does three things:

  1. Converting 40%-50% tag names from upper case to lower case
  2. Removing NULL between tags
  3. Removing empty rows between tags

I did this in using BeautifulSoup , however, I ran into memory issues since some of my XML files are over 1GB. Instead, I looked into some stream method like xml.sax , but I did not quite get the approach. So can anyone give me some suggestions?

xml_str = """
<DATA>

    <ROW>
        <assmtid>1</assmtid>
        <Year>1988</Year>
    </ROW>

    <ROW>
        <assmtid>2</assmtid>
        <Year>NULL</Year>
    </ROW>

    <ROW>
        <assmtid>2</assmtid>
        <Year>1990</Year>
    </ROW>

</DATA>
"""

xml_str_update = re.sub(r">NULL", ">", xml_str)
soup = BeautifulSoup(xml_str_update, "lxml")
print soup.data.prettify().encode('utf-8').strip()

Update

After some testing and taking suggestions from Jarrod Roberson, below is one possible solution.

import os
import xml.etree.cElementTree as etree
from cStringIO import StringIO

def getelements(xml_str):
    context = iter(etree.iterparse(StringIO(xml_str), events=('start', 'end')))
    event, root = next(context)

    for event, elem in context:
        if event == 'end' and elem.tag == "ROW":
            elem.tag = elem.tag.lower()
            elem.text = "\n\t\t"
            elem.tail = "\n\t"

            for child in elem:
                child.tag = child.tag.lower()
                if child.text == "NULL":
                    # if do not like self-closing tag, 
                    # add &#x200B;, which is a zero width space
                    child.text = ""  
                if child.text == None:
                    child.text = ""
                # print event, elem.tag
            yield elem
            root.clear()

with open(pth_to_output_xml, 'wb') as file:
    # start root
    file.write('<data>\n\t')
    for page in getelements(xml_str):
        file.write(etree.tostring(page, encoding='utf-8'))
    # close root
    file.write('</data>')

Iterative parsing

When building an in-memory tree is not desired or practical, use an iterative parsing technique that does not rely on reading the entire source file. lxml offers two approaches: Supplying a target parser class Using the iterparse method

import xml.etree.ElementTree as etree
for event, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')):
  print event, elem

Here is a very complete tutorial on how to do this.

This will parse the XML file in chunks at a time and give it to you at every step of the way. start will trigger when a tag is first encountered. At this point elem will be empty except for elem.attrib that contains the properties of the tag. end will trigger when the closing tag is encountered, and everything in-between has been read.

Then in your event handlers you just write out the transformed information as it is encountered.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM