简体   繁体   English

使用python iterparse大型XML

[英]iterparse large XML using python

This has been driving me nuts all day and i would appreciate a bit of help with parsing a large XML file ... 这整天让我发疯,我将在解析大型XML文件时为您提供帮助...

files contains over 900,000 lines and is downloaded in gzip format, i did have something working using an extract of the data for testing and parsing it with minidom, but thats just not going to cut it for the full file, so I'm looking at iterparse, but i just can't get any of the examples to work, even to the point where I'm getting unable to import errors .... the only import i can get to work is import xml.eTree.cElementTree but that barely seems to work with most of the code examples i have found 文件包含900,000行,并以gzip格式下载,我确实使用数据提取进行了一些工作,以进行测试并以极少的解析度进行了分析,但是那只是不打算将其切成完整的文件,因此我正在研究iterparse,但我只是无法使任何示例工作,甚至到无法导入错误的地步.....我唯一可以工作的导入是import xml.eTree.cElementTree但那我发现的大多数代码示例似乎都无法使用

i did have one thing getting close with iterparse and cElementTree 我确实有一件事与iterparse和cElementTree接近

def buildit(file):
        print file
        #with open(file) as line:
        #print line
        for  event, elem in et.iterparse(file):
                with open(file, "r") as line:
                        for event, elem in et.iterparse(file):
                                print elem.tag
                                if event =='end' and elem.tag=='Journey':
                                        print elem.tag
                                        time.sleep(0.5)
                                        elm.clear

but this prints out the following 但这打印出以下内容

{http://www.website.com/ixid/xmlfile/v8}Journey
{http://www.website.com/ixid/xmlfile/v8}OR
{http://www.website.com/ixid/xmlfile/v8}PP
{http://www.website.com/ixid/xmlfile/v8}IP
{http://www.website.com/ixid/xmlfile/v8}PP
{http://www.website.com/ixid/xmlfile/v8}IP

notice how its putting something form the top element into each item ??? 请注意,它是如何将某种形式的顶部元素放入每个项目中的? anyway ... sample xml below ... and thats in advance for any help 无论如何...下面的示例xml ...多数民众赞成在事先寻求任何帮助

<?xml version="1.0" encoding="utf-8"?>
<PportTimetable xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" timetableID="20160421020832" xmlns="http://www.website.com/ixid/xmlfile/v8">
  <Journey rid="201604211191598" uid="G61365" trainId="1T02" ssd="2016-04-21" toc="SR" trainCat="XX">
    <OR tpl="PERTH" act="TBK " plat="3" ptd="05:18" wtd="05:18" />
    <PP tpl="HILTONJ" wtp="05:22" />
    <IP tpl="GLNEGLS" act="T " plat="1" pta="05:33" ptd="05:33" wta="05:32:30" wtd="05:33:30" />
    <PP tpl="BLFD" wtp="05:37:30" />
    <IP tpl="DUNANE" act="T " plat="1" pta="05:45" ptd="05:46" wta="05:45" wtd="05:46" />
    <IP tpl="BGOALAN" act="T " plat="1" pta="05:49" ptd="05:49" wta="05:49" wtd="05:49:30" />
    <IP tpl="STIRLNG" act="T K " plat="3" pta="05:53" ptd="05:54" wta="05:53" wtd="05:54" />
    <IP tpl="LARBERT" act="T " plat="1" pta="06:03" ptd="06:03" wta="06:02:30" wtd="06:03" />
    <PP tpl="LARBERJ" wtp="06:04:30" />
    <PP tpl="CRMRSWJ" wtp="06:05" />
    <PP tpl="GNHLLJN" wtp="06:09" />
    <OPIP tpl="CMBRNLD" act="C N " plat="1" wta="06:22" wtd="06:24" />
    <PP tpl="GRNQNNJ" wtp="06:30" />
    <PP tpl="GSHRSJN" wtp="06:33" />
    <PP tpl="COATBDC" wtp="06:36:30" />
    <PP tpl="LGLNJN" wtp="06:38" />
    <PP tpl="CARMYLE" plat="1" wtp="06:49" />
    <PP tpl="RTHGNEJ" wtp="06:53:30" />
    <PP tpl="SHFD" wtp="06:56" />
    <PP tpl="LRKFLDJ" wtp="06:59" />
    <PP tpl="EGLNSTJ" wtp="07:01:30" />
    <PP tpl="GLGCBSJ" wtp="07:02:30" />
    <DT tpl="GLGC" act="TF" pta="07:05" wta="07:05" />
  </Journey>
  <Journey rid="201604211192476" uid="G64015" trainId="2N41" ssd="2016-04-21" toc="SR">
    <OR tpl="GLGQLL" act="TB" plat="8" ptd="06:20" wtd="06:20" />
    <PP tpl="FNSTNEJ" wtp="06:23:30" />
    <PP tpl="HYNDLEJ" wtp="06:28:30" />
    <OPIP tpl="ANSL" act="A N " plat="2" wta="06:30" wtd="06:30:30" />
    <PP tpl="MRYHILL" wtp="06:33" />
    <PP tpl="CWLRSNJ" wtp="06:48" />
    <PP tpl="CWLRSEJ" wtp="06:49" />
    <IP tpl="BSHB" act="T " plat="1" pta="06:52" ptd="06:54" wta="06:52" wtd="06:54" />
    <IP tpl="LENZIE" act="T " plat="1" pta="06:59" ptd="06:59" wta="06:58:30" wtd="06:59:30" />
    <IP tpl="CROY" act="T " plat="1" pta="07:06" ptd="07:06" wta="07:05:30" wtd="07:06:30" />
    <PP tpl="GNHLUJN" wtp="07:12:30" />
    <PP tpl="GNHLLJN" wtp="07:15" />
    <PP tpl="CRMRSWJ" wtp="07:17" />
    <PP tpl="LARBERJ" wtp="07:19:30" />
    <IP tpl="LARBERT" act="T " plat="2" pta="07:21" ptd="07:21" wta="07:20:30" wtd="07:21" />
    <IP tpl="STIRLNG" act="T " plat="6" pta="07:30" ptd="07:41" wta="07:29:30" wtd="07:41" />
    <IP tpl="BGOALAN" act="T " plat="2" pta="07:45" ptd="07:45" wta="07:45" wtd="07:45:30" />
    <DT tpl="DUNANE" act="TF" plat="DPV" pta="07:52" wta="07:52" />
  </Journey>
</PportTimetable>

Here is a working program that illustrates how to use .iterparse() from cElementTree , storing the results in a database. 下面是说明如何使用工作程序.iterparse()cElementTree ,存储在数据库中的结果。 Note that this program is aware of the namespace used in the input XML. 请注意,该程序知道输入XML中使用的名称空间。

The i.xml is identical to the example XML given in the question. i.xml与问题中给出的示例XML相同。

# Tested on Python 2.6.7, Ubuntu 14.04.4
import xml.etree.cElementTree as et
import sqlite3

# Tools to deal with namespaces
ixid_uri = 'http://www.website.com/ixid/xmlfile/v8'
def extract_local_tag(qname):
    return qname.split('}')[-1]

# A db connection to illustrate the example
conn = sqlite3.connect(":memory:")
c = conn.cursor()
c.execute("create table foo (joury_uid text, tag text, tpl text)")
conn.commit()

# The main part of the code: iterate over the XML,
# storing DB stuff at the end of every <Journey>
with open('i.xml') as xml_file:
    for event, elem in et.iterparse(xml_file):
        # Must compare tag to qualified name
        if elem.tag == et.QName(ixid_uri, 'Journey'):
            c.executemany('insert into foo values(?, ?, ?)',
                [
                    (elem.attrib['uid'],
                    extract_local_tag(child.tag),
                    child.attrib.get('tpl', None))
                    for child in elem
                ])
            conn.commit()
            # Note: only clears <Journey> elements and their children.
            # There is a memory leak of any elements not children of <Journey>
            elem.clear()    
for row in c.execute('select * from foo'):
    print row

Result: 结果:

(u'G61365', u'OR', u'PERTH')
(u'G61365', u'PP', u'HILTONJ')
...
(u'G61365', u'DT', u'GLGC')
(u'G64015', u'OR', u'GLGQLL')
(u'G64015', u'PP', u'FNSTNEJ')
...

References: 参考文献:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM