简体   繁体   中英

iterparse large XML using python

This has been driving me nuts all day and i would appreciate a bit of help with parsing a large XML file ...

files contains over 900,000 lines and is downloaded in gzip format, i did have something working using an extract of the data for testing and parsing it with minidom, but thats just not going to cut it for the full file, so I'm looking at iterparse, but i just can't get any of the examples to work, even to the point where I'm getting unable to import errors .... the only import i can get to work is import xml.eTree.cElementTree but that barely seems to work with most of the code examples i have found

i did have one thing getting close with iterparse and cElementTree

def buildit(file):
        print file
        #with open(file) as line:
        #print line
        for  event, elem in et.iterparse(file):
                with open(file, "r") as line:
                        for event, elem in et.iterparse(file):
                                print elem.tag
                                if event =='end' and elem.tag=='Journey':
                                        print elem.tag
                                        time.sleep(0.5)
                                        elm.clear

but this prints out the following

{http://www.website.com/ixid/xmlfile/v8}Journey
{http://www.website.com/ixid/xmlfile/v8}OR
{http://www.website.com/ixid/xmlfile/v8}PP
{http://www.website.com/ixid/xmlfile/v8}IP
{http://www.website.com/ixid/xmlfile/v8}PP
{http://www.website.com/ixid/xmlfile/v8}IP

notice how its putting something form the top element into each item ??? anyway ... sample xml below ... and thats in advance for any help

<?xml version="1.0" encoding="utf-8"?>
<PportTimetable xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" timetableID="20160421020832" xmlns="http://www.website.com/ixid/xmlfile/v8">
  <Journey rid="201604211191598" uid="G61365" trainId="1T02" ssd="2016-04-21" toc="SR" trainCat="XX">
    <OR tpl="PERTH" act="TBK " plat="3" ptd="05:18" wtd="05:18" />
    <PP tpl="HILTONJ" wtp="05:22" />
    <IP tpl="GLNEGLS" act="T " plat="1" pta="05:33" ptd="05:33" wta="05:32:30" wtd="05:33:30" />
    <PP tpl="BLFD" wtp="05:37:30" />
    <IP tpl="DUNANE" act="T " plat="1" pta="05:45" ptd="05:46" wta="05:45" wtd="05:46" />
    <IP tpl="BGOALAN" act="T " plat="1" pta="05:49" ptd="05:49" wta="05:49" wtd="05:49:30" />
    <IP tpl="STIRLNG" act="T K " plat="3" pta="05:53" ptd="05:54" wta="05:53" wtd="05:54" />
    <IP tpl="LARBERT" act="T " plat="1" pta="06:03" ptd="06:03" wta="06:02:30" wtd="06:03" />
    <PP tpl="LARBERJ" wtp="06:04:30" />
    <PP tpl="CRMRSWJ" wtp="06:05" />
    <PP tpl="GNHLLJN" wtp="06:09" />
    <OPIP tpl="CMBRNLD" act="C N " plat="1" wta="06:22" wtd="06:24" />
    <PP tpl="GRNQNNJ" wtp="06:30" />
    <PP tpl="GSHRSJN" wtp="06:33" />
    <PP tpl="COATBDC" wtp="06:36:30" />
    <PP tpl="LGLNJN" wtp="06:38" />
    <PP tpl="CARMYLE" plat="1" wtp="06:49" />
    <PP tpl="RTHGNEJ" wtp="06:53:30" />
    <PP tpl="SHFD" wtp="06:56" />
    <PP tpl="LRKFLDJ" wtp="06:59" />
    <PP tpl="EGLNSTJ" wtp="07:01:30" />
    <PP tpl="GLGCBSJ" wtp="07:02:30" />
    <DT tpl="GLGC" act="TF" pta="07:05" wta="07:05" />
  </Journey>
  <Journey rid="201604211192476" uid="G64015" trainId="2N41" ssd="2016-04-21" toc="SR">
    <OR tpl="GLGQLL" act="TB" plat="8" ptd="06:20" wtd="06:20" />
    <PP tpl="FNSTNEJ" wtp="06:23:30" />
    <PP tpl="HYNDLEJ" wtp="06:28:30" />
    <OPIP tpl="ANSL" act="A N " plat="2" wta="06:30" wtd="06:30:30" />
    <PP tpl="MRYHILL" wtp="06:33" />
    <PP tpl="CWLRSNJ" wtp="06:48" />
    <PP tpl="CWLRSEJ" wtp="06:49" />
    <IP tpl="BSHB" act="T " plat="1" pta="06:52" ptd="06:54" wta="06:52" wtd="06:54" />
    <IP tpl="LENZIE" act="T " plat="1" pta="06:59" ptd="06:59" wta="06:58:30" wtd="06:59:30" />
    <IP tpl="CROY" act="T " plat="1" pta="07:06" ptd="07:06" wta="07:05:30" wtd="07:06:30" />
    <PP tpl="GNHLUJN" wtp="07:12:30" />
    <PP tpl="GNHLLJN" wtp="07:15" />
    <PP tpl="CRMRSWJ" wtp="07:17" />
    <PP tpl="LARBERJ" wtp="07:19:30" />
    <IP tpl="LARBERT" act="T " plat="2" pta="07:21" ptd="07:21" wta="07:20:30" wtd="07:21" />
    <IP tpl="STIRLNG" act="T " plat="6" pta="07:30" ptd="07:41" wta="07:29:30" wtd="07:41" />
    <IP tpl="BGOALAN" act="T " plat="2" pta="07:45" ptd="07:45" wta="07:45" wtd="07:45:30" />
    <DT tpl="DUNANE" act="TF" plat="DPV" pta="07:52" wta="07:52" />
  </Journey>
</PportTimetable>

Here is a working program that illustrates how to use .iterparse() from cElementTree , storing the results in a database. Note that this program is aware of the namespace used in the input XML.

The i.xml is identical to the example XML given in the question.

# Tested on Python 2.6.7, Ubuntu 14.04.4
import xml.etree.cElementTree as et
import sqlite3

# Tools to deal with namespaces
ixid_uri = 'http://www.website.com/ixid/xmlfile/v8'
def extract_local_tag(qname):
    return qname.split('}')[-1]

# A db connection to illustrate the example
conn = sqlite3.connect(":memory:")
c = conn.cursor()
c.execute("create table foo (joury_uid text, tag text, tpl text)")
conn.commit()

# The main part of the code: iterate over the XML,
# storing DB stuff at the end of every <Journey>
with open('i.xml') as xml_file:
    for event, elem in et.iterparse(xml_file):
        # Must compare tag to qualified name
        if elem.tag == et.QName(ixid_uri, 'Journey'):
            c.executemany('insert into foo values(?, ?, ?)',
                [
                    (elem.attrib['uid'],
                    extract_local_tag(child.tag),
                    child.attrib.get('tpl', None))
                    for child in elem
                ])
            conn.commit()
            # Note: only clears <Journey> elements and their children.
            # There is a memory leak of any elements not children of <Journey>
            elem.clear()    
for row in c.execute('select * from foo'):
    print row

Result:

(u'G61365', u'OR', u'PERTH')
(u'G61365', u'PP', u'HILTONJ')
...
(u'G61365', u'DT', u'GLGC')
(u'G64015', u'OR', u'GLGQLL')
(u'G64015', u'PP', u'FNSTNEJ')
...

References:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM