解析多个大型XML文件并写入CSV

Question

I have around 1000 XML files each of 250 MB size. 我大约有1000个XML文件，每个文件的大小为250 MB。 I need to extract some data from them and write to CSV. 我需要从中提取一些数据并写入CSV。 There cannot not be any duplicate entries. 不能有任何重复的条目。

I have a system with 4GB RAM and an AMD A8 processor. 我有一个配备4GB RAM和AMD A8处理器的系统。

I have already gone through some previous posts here but they don't seem to answer my problem. 我已经在这里浏览过一些以前的文章，但是它们似乎并不能回答我的问题。

I have already written the code in Python and tested it on a sample XML and it worked well. 我已经用Python编写了代码，并在示例XML上对其进行了测试，并且效果很好。

However it was very slow (almost 15 mins for each file) when I used it on all files and had to terminate the process midway. 但是，当我在所有文件上使用它并且必须中途终止该过程时，它非常慢（每个文件将近15分钟）。

What can be an optimal solution to speed up the process? 有什么最佳方法可以加快该过程？

Here's the code 这是代码

path='data/*.xml'
t=[]
for fname in glob.glob(path):
    print('Parsing ',fname)
    tree=ET.parse(fname)
    root=tree.getroot()
    x=root.findall('//Article/AuthorList//Author')
    for child in x:
        try:
            lastName=child.find('LastName').text
        except AttributeError:
            lastName=''
        try:
            foreName=child.find('ForeName').text
        except AttributeError:
            foreName=''
        t.append((lastName,foreName))
    print('Parsed ',fname)

t=set(t)

I want the fastest method to get the entries without any duplicate values . 我想要最快的方法来获取没有任何重复值的条目。 (Maybe storing in some DB instead of variable t, Will storing each entry in DB speed up due to more free RAM ?- whatever be the method I need direction towards it) （也许存储在某个数据库中而不是变量t中，由于更多的可用RAM，将每个条目存储在数据库中会加速吗？-无论我需要采用哪种方法进行定向）

Answer 1

Instead of writing the results to a Python list, create a database table with a UNIQUE constraint, and write all the results to that table. 不要将结果写入Python列表，而是创建具有UNIQUE约束的数据库表，然后将所有结果写入该表。 Once all the writing has been done, dump the DB table as a csv. 完成所有写入后，将DB表转为csv。

If you don't want to have any additional dependencies for writing to the DB, I suggest you use sqlite3 , as it comes right out of the box with any recent Python installation. 如果您不希望有任何其他依赖关系来写入数据库，则建议您使用sqlite3 ，因为它随最新的Python安装一起提供。

Here's some code to get started: 这是一些入门代码：

import sqlite3
conn = sqlite3.connect('large_xml.db')  # db will be created
cur = conn.cursor()
crt = "CREATE TABLE foo(fname VARCHAR(20), lname VARCHAR(20), UNIQUE(fname, lname))"
cur.execute(crt)
conn.commit()

path='data/*.xml'
for fname in glob.glob(path):
    print('Parsing ',fname)
    tree=ET.parse(fname)
    root=tree.getroot()
    x=root.findall('//Article/AuthorList//Author')
    count = 0
    for child in x:
        try:
            lastName=child.find('LastName').text
        except AttributeError:
            lastName=''
        try:
            foreName=child.find('ForeName').text
        except AttributeError:
            foreName=''
        cur.execute("INSERT OR IGNORE INTO foo(fname, lname) VALUES(?, ?)", (foreName, lastName))
        count += 1
        if count > 3000:  # commit every 3000 entries, you can tune this
            count = 0
            conn.commit()

    print('Parsed ',fname)

After the database is populated, dump it to csv as follows: 填充数据库后，将其转储到csv，如下所示：

sqlite3 -header -csv large_xml.db "select * from foo;" > dump.csv

Also, experiment with faster parsing ways. 另外，尝试使用更快的解析方式。 Furthermore, if the .text attribute is available most of the times, following will perhaps be faster than exception handling: 此外，如果.text属性在大多数情况下可用，那么以下操作可能比异常处理要快：

lastName = getattr(child.find('LastName'), 'text', '')

解析多个大型XML文件并写入CSV

问题描述

1 个解决方案

解决方案1
4 已采纳 2019-09-07 07:27:52

解析多个大型XML文件并写入CSV

问题描述

1 个解决方案

解决方案1 4 已采纳 2019-09-07 07:27:52

解决方案1
4 已采纳 2019-09-07 07:27:52