Parsing of a huge xml file with `pythons etree.iterparse()` does not work right. Is there a logic error in the code?

Question

I want to parse a huge file xml-file. The records in this huge file do look for example like this . And in general the file looks like this

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
    record_1
    ...
    record_n
</dblp>

I wrote some code, that shall get me a selection of recordings from this file.

If I let the code run (takes nearly 50 Minutes including storage in the MySQL-Database) I notice, that there is a record, which seams to have nearly a million authors. This must be wrong. I even checked up on it by looking into the file make sure, that the file has no errors in it. The paper has only 5 or 6 authors, so all is fine wirh dblp.xml. So I assume a logic error in my code. But I can't figure out where this could be. Perhaps someone can tell me, where the error is?

The code stops in the line if len(auth) > 2000 .

import sys
import MySQLdb
from lxml import etree


elements = ['article', 'inproceedings', 'proceedings', 'book', 'incollection']
tags = ["author", "title", "booktitle", "year", "journal"]


def fast_iter(context, cursor):
    mydict = {} # represents a paper with all its tags.
    auth = [] # a list of authors who have written the paper "together".
    counter = 0 # counts the papers

    for event, elem in context:
        if elem.tag in elements and event == "start":
            mydict["element"] = elem.tag
            mydict["mdate"] = elem.get("mdate")
            mydict["key"] = elem.get("key")

        elif elem.tag == "title" and elem.text != None:
            mydict["title"] = elem.text
        elif elem.tag == "booktitle" and elem.text != None:
            mydict["booktitle"] = elem.text
        elif elem.tag == "year" and elem.text != None:
            mydict["year"] = elem.text
        elif elem.tag == "journal" and elem.text != None:
            mydict["journal"] = elem.text
        elif elem.tag == "author" and elem.text != None:
            auth.append(elem.text)
        elif event == "end" and elem.tag in elements:
            counter += 1
            print counter
            #populate_database(mydict, auth, cursor)
            mydict.clear()
            auth = []
            if mydict or auth:
                sys.exit("Program aborted because auth or mydict was not deleted properly!")
        if len(auth) > 200: # There are up to ~150 authors per paper. 
            sys.exit("auth: It seams there is a paper which has too many authors.!")
        if len(mydict) > 50: # A paper can have much metadata.
            sys.exit("mydict: It seams there is a paper which has too many tags.")

        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context


def main():
        cursor = connectToDatabase()
        cursor.execute("""SET NAMES utf8""")

        context = etree.iterparse(PATH_TO_XML, dtd_validation=True, events=("start", "end"))
        fast_iter(context, cursor)

        cursor.close()


if __name__ == '__main__':
    main()

EDIT:

I was totally misguided, when I wrote this function. I made a huge mistake by overlooking, that while trying to skip some unwanted recordings the get messed up with some wanted recordings. And at a certain point in the file, where I skiped nearly a million records in a row, the following wanted record got blown up.

With the help of John and Paul I managed to rewrite my code. It is parsing right now, and seams to do it well. I'll report back, if some unexpected errors remained unsolved. Elsewise thank you all for your help! I really appreciated it!

def fast_iter2(context, cursor):
    elements = set([
        'article', 'inproceedings', 'proceedings', 'book', 'incollection',
        'phdthesis', "mastersthesis", "www"
        ])
    childElements = set(["title", "booktitle", "year", "journal", "ee"])

    paper = {} # represents a paper with all its tags.
    authors = []   # a list of authors who have written the paper "together".
    paperCounter = 0
    for event, element in context:
        tag = element.tag
        if tag in childElements:
            if element.text:
                paper[tag] = element.text
                # print tag, paper[tag]
        elif tag == "author":
            if element.text:
                authors.append(element.text)
                # print "AUTHOR:", authors[-1]
        elif tag in elements:
            paper["element"] = tag
            paper["mdate"] = element.get("mdate")
            paper["dblpkey"] = element.get("key")
            # print tag, element.get("mdate"), element.get("key"), event
            if paper["element"] in ['phdthesis', "mastersthesis", "www"]:
                pass
            else:
                populate_database(paper, authors, cursor)
            paperCounter += 1
            print paperCounter
            paper = {}
            authors = []
            # if paperCounter == 100:
            #     break
            element.clear()
            while element.getprevious() is not None:
                del element.getparent()[0]
    del context

Answer 1

Add print statements in the blocks of code where you detect start and stop of a tag in elements, to make sure you are detecting these properly. I suspect that for some reason you aren't getting to the code that clears the authors list.

Try commenting out this code (or at least, move it into the "end" handling block):

    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]

Python should take care of clearing these elements for you as you traverse the XML. The "del context" is also superfluous. Let the reference counters do the work for you here.

Answer 2

Please eliminate one source of confusion: You haven't actually said that the code that you showed does actually trip over on one of your "count of things > 2000" tests. If not, then the problem lies in the database update code (that you haven't showed us).

If it does so trip over:

(1) Reduce the limits from 2000 to reasonable values (about 20 for auth and exactly 7 for mydict)

(2) When the trip happens, print repr(mydict); print; print repr(auth) print repr(mydict); print; print repr(auth) print repr(mydict); print; print repr(auth) and analyse the contents in comparison with your file.

Aside: with iterparse(), elem.text is NOT guaranteed to have been parsed when the "start" event happens. To save some running time, you should access elem.text only when the "end" event happens. In fact, there seems to be no reason why you want "start" events at all. Also you define a list tags but never use it. The start of your function could be written much more concisely:

def fast_iter(context, cursor):
    mydict = {} # represents a paper with all its tags.
    auth = [] # a list of authors who have written the paper "together".
    counter = 0 # counts the papers
    tagset1 = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection'])
    tagset2 = set(["title", "booktitle", "year", "journal"])
    for event, elem in context:
        tag = elem.tag
        if tag in tagset2:
            if elem.text:
                mydict[tag] = elem.text
        elif tag == "author":
            if elem.text:
                auth.append(elem.text)
        elif tag in tagset1:
            counter += 1
            print counter
            mydict["element"] = tag
            mydict["mdate"] = elem.get("mdate")
            mydict["dblpkey"] = elem.get("key")
            #populate_database(mydict, auth, cursor)
            mydict.clear() # Why not just do mydict = {} ??
            auth = []
            # etc etc

Don't forget to fix the call to iterparse() to remove the events arg.

Also I'm reasonably certain that the elem.clear() should be done only when event is "end" and needs to be done only when tag in tagset1 . Read the relevant docs carefully. Doing the cleanup in a "start" event could very well be damaging your tree.

Parsing of a huge xml file with `pythons etree.iterparse()` does not work right. Is there a logic error in the code?

Question

EDIT:

2 answers

solution1
3 2011-05-17 10:34:34

solution2
3 ACCPTED 2011-05-17 12:37:19

Parsing of a huge xml file with `pythons etree.iterparse()` does not work right. Is there a logic error in the code?

Question

EDIT:

2 answers

solution1 3 2011-05-17 10:34:34

solution2 3 ACCPTED 2011-05-17 12:37:19

solution1
3 2011-05-17 10:34:34

solution2
3 ACCPTED 2011-05-17 12:37:19