I tried to use Python to cleanup some messy XML files, which does three things:
I did this in using BeautifulSoup
, however, I ran into memory issues since some of my XML files are over 1GB. Instead, I looked into some stream method like xml.sax
, but I did not quite get the approach. So can anyone give me some suggestions?
xml_str = """
<DATA>
<ROW>
<assmtid>1</assmtid>
<Year>1988</Year>
</ROW>
<ROW>
<assmtid>2</assmtid>
<Year>NULL</Year>
</ROW>
<ROW>
<assmtid>2</assmtid>
<Year>1990</Year>
</ROW>
</DATA>
"""
xml_str_update = re.sub(r">NULL", ">", xml_str)
soup = BeautifulSoup(xml_str_update, "lxml")
print soup.data.prettify().encode('utf-8').strip()
After some testing and taking suggestions from Jarrod Roberson, below is one possible solution.
import os
import xml.etree.cElementTree as etree
from cStringIO import StringIO
def getelements(xml_str):
context = iter(etree.iterparse(StringIO(xml_str), events=('start', 'end')))
event, root = next(context)
for event, elem in context:
if event == 'end' and elem.tag == "ROW":
elem.tag = elem.tag.lower()
elem.text = "\n\t\t"
elem.tail = "\n\t"
for child in elem:
child.tag = child.tag.lower()
if child.text == "NULL":
# if do not like self-closing tag,
# add ​, which is a zero width space
child.text = ""
if child.text == None:
child.text = ""
# print event, elem.tag
yield elem
root.clear()
with open(pth_to_output_xml, 'wb') as file:
# start root
file.write('<data>\n\t')
for page in getelements(xml_str):
file.write(etree.tostring(page, encoding='utf-8'))
# close root
file.write('</data>')
When building an in-memory tree is not desired or practical, use an iterative parsing technique that does not rely on reading the entire source file. lxml offers two approaches: Supplying a target parser class Using the iterparse method
import xml.etree.ElementTree as etree
for event, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')):
print event, elem
Here is a very complete tutorial on how to do this.
This will parse the XML file in chunks at a time and give it to you at every step of the way. start will trigger when a tag is first encountered. At this point elem will be empty except for elem.attrib that contains the properties of the tag. end will trigger when the closing tag is encountered, and everything in-between has been read.
Then in your event handlers you just write out the transformed information as it is encountered.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.