简体   繁体   English

用pythons etree.iterparse()解析巨大的xml文件无法正常进行。 代码中是否存在逻辑错误?

[英]Parsing of a huge xml file with `pythons etree.iterparse()` does not work right. Is there a logic error in the code?

I want to parse a huge file xml-file. 我想解析一个巨大的文件xml文件。 The records in this huge file do look for example like this . 这个巨大文件中的记录确实看起来像这样 And in general the file looks like this 通常文件看起来像这样

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
    record_1
    ...
    record_n
</dblp>

I wrote some code, that shall get me a selection of recordings from this file. 我写了一些代码,可以从该文件中选择录音。

If I let the code run (takes nearly 50 Minutes including storage in the MySQL-Database) I notice, that there is a record, which seams to have nearly a million authors. 如果我让代码运行(花了将近50分钟,包括存储在MySQL数据库中的时间), 我会注意到有一条记录,表明有近一百万的作者。 This must be wrong. 这一定是错的。 I even checked up on it by looking into the file make sure, that the file has no errors in it. 我什至通过查看文件来检查它,确保该文件中没有错误。 The paper has only 5 or 6 authors, so all is fine wirh dblp.xml. 该论文只有5或6位作者,因此dblp.xml很好。 So I assume a logic error in my code. 因此,我假设代码中存在逻辑错误。 But I can't figure out where this could be. 但我不知道这可能在哪里。 Perhaps someone can tell me, where the error is? 也许有人可以告诉我,错误在哪里?

The code stops in the line if len(auth) > 2000 . if len(auth) > 2000代码在行中停止。

import sys
import MySQLdb
from lxml import etree


elements = ['article', 'inproceedings', 'proceedings', 'book', 'incollection']
tags = ["author", "title", "booktitle", "year", "journal"]


def fast_iter(context, cursor):
    mydict = {} # represents a paper with all its tags.
    auth = [] # a list of authors who have written the paper "together".
    counter = 0 # counts the papers

    for event, elem in context:
        if elem.tag in elements and event == "start":
            mydict["element"] = elem.tag
            mydict["mdate"] = elem.get("mdate")
            mydict["key"] = elem.get("key")

        elif elem.tag == "title" and elem.text != None:
            mydict["title"] = elem.text
        elif elem.tag == "booktitle" and elem.text != None:
            mydict["booktitle"] = elem.text
        elif elem.tag == "year" and elem.text != None:
            mydict["year"] = elem.text
        elif elem.tag == "journal" and elem.text != None:
            mydict["journal"] = elem.text
        elif elem.tag == "author" and elem.text != None:
            auth.append(elem.text)
        elif event == "end" and elem.tag in elements:
            counter += 1
            print counter
            #populate_database(mydict, auth, cursor)
            mydict.clear()
            auth = []
            if mydict or auth:
                sys.exit("Program aborted because auth or mydict was not deleted properly!")
        if len(auth) > 200: # There are up to ~150 authors per paper. 
            sys.exit("auth: It seams there is a paper which has too many authors.!")
        if len(mydict) > 50: # A paper can have much metadata.
            sys.exit("mydict: It seams there is a paper which has too many tags.")

        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context


def main():
        cursor = connectToDatabase()
        cursor.execute("""SET NAMES utf8""")

        context = etree.iterparse(PATH_TO_XML, dtd_validation=True, events=("start", "end"))
        fast_iter(context, cursor)

        cursor.close()


if __name__ == '__main__':
    main()

EDIT: 编辑:

I was totally misguided, when I wrote this function. 编写此函数时,我完全被误导了。 I made a huge mistake by overlooking, that while trying to skip some unwanted recordings the get messed up with some wanted recordings. 我忽略了一个巨大的错误,那就是在尝试跳过一些不需要的录音时,弄乱了一些想要的录音。 And at a certain point in the file, where I skiped nearly a million records in a row, the following wanted record got blown up. 在文件的某个位置,我连续跳过了将近一百万条记录,以下想要的记录被炸掉了。

With the help of John and Paul I managed to rewrite my code. 在约翰和保罗的帮助下,我设法重写了我的代码。 It is parsing right now, and seams to do it well. 现在正在解析,并且接缝要做好。 I'll report back, if some unexpected errors remained unsolved. 如果仍有一些无法解决的错误,我会报告。 Elsewise thank you all for your help! 否则,谢谢大家的帮助! I really appreciated it! 我真的很感激!

def fast_iter2(context, cursor):
    elements = set([
        'article', 'inproceedings', 'proceedings', 'book', 'incollection',
        'phdthesis', "mastersthesis", "www"
        ])
    childElements = set(["title", "booktitle", "year", "journal", "ee"])

    paper = {} # represents a paper with all its tags.
    authors = []   # a list of authors who have written the paper "together".
    paperCounter = 0
    for event, element in context:
        tag = element.tag
        if tag in childElements:
            if element.text:
                paper[tag] = element.text
                # print tag, paper[tag]
        elif tag == "author":
            if element.text:
                authors.append(element.text)
                # print "AUTHOR:", authors[-1]
        elif tag in elements:
            paper["element"] = tag
            paper["mdate"] = element.get("mdate")
            paper["dblpkey"] = element.get("key")
            # print tag, element.get("mdate"), element.get("key"), event
            if paper["element"] in ['phdthesis', "mastersthesis", "www"]:
                pass
            else:
                populate_database(paper, authors, cursor)
            paperCounter += 1
            print paperCounter
            paper = {}
            authors = []
            # if paperCounter == 100:
            #     break
            element.clear()
            while element.getprevious() is not None:
                del element.getparent()[0]
    del context

Add print statements in the blocks of code where you detect start and stop of a tag in elements, to make sure you are detecting these properly. 在代码块中添加打印语句,以便在其中检测元素中标签的开始和停止,以确保正确检测到这些语句。 I suspect that for some reason you aren't getting to the code that clears the authors list. 我怀疑由于某种原因您没有获得清除作者列表的代码。

Try commenting out this code (or at least, move it into the "end" handling block): 尝试注释掉此代码(或至少将其移到“结束”处理块中):

    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]

Python should take care of clearing these elements for you as you traverse the XML. 在遍历XML时,Python应该为您清除这些元素。 The "del context" is also superfluous. “ del上下文”也是多余的。 Let the reference counters do the work for you here. 让参考计数器在这里为您完成工作。

Please eliminate one source of confusion: You haven't actually said that the code that you showed does actually trip over on one of your "count of things > 2000" tests. 请消除一个混乱的根源:您实际上并没有说过您显示的代码确实经过了您的“事物计数> 2000”测试之一。 If not, then the problem lies in the database update code (that you haven't showed us). 如果没有,那么问题就出在数据库更新代码上(您没有显示给我们)。

If it does so trip over: 如果是这样,请跳过去:

(1) Reduce the limits from 2000 to reasonable values (about 20 for auth and exactly 7 for mydict) (1)将限制从2000降低到合理的值(对于auth大约为20,对于mydict大约为7)

(2) When the trip happens, print repr(mydict); print; print repr(auth) (2)当旅行发生时, print repr(mydict); print; print repr(auth) print repr(mydict); print; print repr(auth) print repr(mydict); print; print repr(auth) and analyse the contents in comparison with your file. print repr(mydict); print; print repr(auth)并与文件比较分析内容。

Aside: with iterparse(), elem.text is NOT guaranteed to have been parsed when the "start" event happens. 另外:使用iterparse(),不能保证在发生“开始”事件时无法解析elem.text。 To save some running time, you should access elem.text only when the "end" event happens. 为了节省一些运行时间,仅应在“结束”事件发生时访问elem.text。 In fact, there seems to be no reason why you want "start" events at all. 实际上,您似乎根本没有理由想要“开始”事件。 Also you define a list tags but never use it. 另外,您定义列表tags但从不使用它。 The start of your function could be written much more concisely: 函数的开始可以写得更简洁:

def fast_iter(context, cursor):
    mydict = {} # represents a paper with all its tags.
    auth = [] # a list of authors who have written the paper "together".
    counter = 0 # counts the papers
    tagset1 = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection'])
    tagset2 = set(["title", "booktitle", "year", "journal"])
    for event, elem in context:
        tag = elem.tag
        if tag in tagset2:
            if elem.text:
                mydict[tag] = elem.text
        elif tag == "author":
            if elem.text:
                auth.append(elem.text)
        elif tag in tagset1:
            counter += 1
            print counter
            mydict["element"] = tag
            mydict["mdate"] = elem.get("mdate")
            mydict["dblpkey"] = elem.get("key")
            #populate_database(mydict, auth, cursor)
            mydict.clear() # Why not just do mydict = {} ??
            auth = []
            # etc etc

Don't forget to fix the call to iterparse() to remove the events arg. 不要忘记修复对iterparse()的调用,以删除事件arg。

Also I'm reasonably certain that the elem.clear() should be done only when event is "end" and needs to be done only when tag in tagset1 . 我也可以肯定elem.clear()仅在event为“ end”时才执行tag in tagset1tag in tagset1时才需要tag in tagset1 Read the relevant docs carefully. 请仔细阅读相关文档 Doing the cleanup in a "start" event could very well be damaging your tree. 在“开始”事件中进行清理很可能会损坏您的树。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM