简体   繁体   中英

Parsing XML file using parseString from xml.dom.minidom has poor efficiency?

I am trying to parse a XML file with Python 2.7. The size of the XML file is 370+ MB, and contains 6,541,000 rows.

The XML file is composed of 300K of following blocks:

<Tag:Member>
    <fileID id = '123456789'>
    <miscTag> 123 </miscTag>
    <miscTag2> 456 </miscTag2>
    <DateTag> 2008-02-02 </DateTag>
    <Tag2:descriptiveTerm>Keyword_1</Tag2:descriptiveTerm>
    <miscTag3>6.330016</miscTag3>
    <historyTag>
        <DateTag>2001-04-16</DateTag>
        <reasonTag>Refresh</reasonTag>
    </historyTag>
    <Tag3:make>Keyword_2</Tag3:make>
    <miscTag4>
            <miscTag5>
                <Tag4:coordinates>6.090,6.000 5.490,4.300 6.090,6.000 </Tag4:coordinates>
            </miscTag5>
        </miscTag4>
</Tag:Member>

I used following code:

from xml.dom.minidom import parseString

def XMLParser(filePath):    
    """ ===== Load XML File into Memory ===== """
    datafile = open(filePath)
    data = datafile.read()
    datafile.close()
    dom = parseString(data)    

    length = len(dom.getElementsByTagName("Tag:Member"))


    counter = 0
    while counter < length:
        """ ===== Extract Descriptive Term ===== """
        contentString = dom.getElementsByTagName("Tag2:descriptiveTerm")[counter].toxml()

        laterpart = contentString.split("Tag2:descriptiveTerm>", 1)[1]

        descriptiveTerm = laterpart.split("</Tag2:descriptiveTerm>", 1)[0]    


        if descriptiveGroup == "Keyword_1":
            """ ===== Extract Make ===== """
            contentString = dom.getElementsByTagName("Tag3:make")[counter].toxml()

            laterpart = contentString.split("<Tag3:make>", 1)[1]

            make = laterpart.split("</Tag3:make>", 1)[0]



            if descriptiveTerm == "Keyword_1" and make == "Keyword_2":
                """ ===== Extract ID ===== """        
                contentString = dom.getElementsByTagName("Tag:Member")[counter].toxml()

                laterpart = contentString.split("id=\"", 1)[1]

                laterpart = laterpart.split("Tag", 1)[1]

                IDString = laterpart.split("\">", 1)[0]



                """ ===== Extract Coordinates ===== """
                contentString = dom.getElementsByTagName("Tag:Member")[counter].toxml()

                laterpart = contentString.split("coordinates>", 1)[1]

                coordString = laterpart.split(" </Tag4:coordinates>", 1)[0]            


        counter += 1

So, I've run this, and found that it takes about 27GB of the memory, and parsing each of the above blocks taks more than 20 seconds. So it will take 2 months to parse this file!

I guess I've wrote some poor efficiency code. Can anyone help me to improve it?

Many thanks.

For a file of this size, the correct approach is a streaming parser (SAX-style, not DOM-style, so minidom is entirely inappropriate). See this answer for notes on using lxml.iterparse (a recent/modern streaming parser which uses libxml2 -- a fast and efficient XML-parsing library written in C -- on its backend) in a memory-efficient way, or the article on which that answer is based .

In general -- as you see elements associated with a member, you should build that member up in memory, and when you see an event associated with the end of the tag, then you emit or process the built-up in-memory content and start a fresh new one.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM