简体   繁体   English

是否有更有效的方法将 XML 解析为具有 python 的数据库?

[英]Is there a more efficient way to parse XML to a database with python?

I am using the ElementTree and pyodbc libraries to parse a rather large (400MB) XML file with python.我正在使用 ElementTree 和 pyodbc 库来解析带有 python 的相当大的 (400MB) XML 文件。 Currently the code works and the correct information is sent to the correct SQL Server tables.目前代码有效,并且正确的信息被发送到正确的 SQL 服务器表。 The problem is that it's very slow.问题是它非常慢。 I expect parsing of any file this size to be relatively slow, but its about 100,000 rows an hour.我希望解析这种大小的任何文件都相对较慢,但每小时大约 100,000 行。

Do I a.) need to suck up the runtime and make use of strong dedicated VMs to process the file.我是否 a.) 需要占用运行时并利用强大的专用 VM 来处理文件。 or b.) need to make the code more efficient或 b.) 需要使代码更高效

import xml.etree.ElementTree as ET
import pyodbc
import time

source_file_artist = 'discogs_20120101_artists.xml'


conn = pyodbc.connect('Driver={SQL Server};'
                      'Server=my server;'
                      'Database=DiscogsDatabase;'
                      'Trusted_Connection=yes;')

def main():
    cursor = conn.cursor()
    with open(source_file_artist, 'r') as sf:
        xml_iter = ET.iterparse(source_file_artist, events=('start', 'end'))
        nameCheck = 0
        key = ''
        for event, elem in xml_iter:
            if event =='start':
                if elem.tag =='namevariations':
                    nameCheck = 2
                elif elem.tag == 'aliases':
                    nameCheck = 3 
                elif elem.tag =='members':
                    nameCheck = 4

            elif event == 'end':
                if elem.tag == 'id':
                    elemtext = '%s' % elem.text
                    #print(elemtext)
                    cursor.execute('INSERT INTO [DiscogsDatabase].[dbo].[ARTIST](ArtistID) VALUES(?)', int(elemtext))
                    cursor.commit()
                    key = elemtext
                    nameCheck = 1

                elif elem.tag == 'name':
                    if nameCheck == 1:      
                        elemtext = '%s' % elem.text
                        #print(elemtext)
                        cursor.execute('UPDATE [DiscogsDatabase].[dbo].[ARTIST] SET ArtistName = ? WHERE ArtistID = ?', elemtext, key)
                        cursor.commit()
                        nameCheck = 0

                    elif nameCheck == 2:
                        elemtext =  '%s' % elem.text
                        #print(elemtext)
                        cursor.execute('INSERT INTO [DiscogsDatabase].[dbo].[NAMEVARIATION](ArtistID, VariationName) VALUES(?, ?)', key, elemtext)
                    
                    elif nameCheck == 3:
                        elemtext = '%s' % elem.text
                        #print(elemtext)
                        cursor.execute('INSERT INTO [DiscogsDatabase].[dbo].[ALIAS](ArtistID, AliasName) VALUES(?, ?)', key, elem.text)
                        cursor.commit()

                    elif nameCheck == 4:
                        elemtext = '%s' % elem.text
                        #print(elemtext)
                        cursor.execute('INSERT INTO [DiscogsDatabase].[dbo].[MEMBERS](ArtistID, MemberName) VALUES(?, ?)', key, elemtext)
                        cursor.commit()

                elif elem.tag == 'realname':
                    elemtext = '%s' % elem.text
                    #print(elemtext)
                    cursor.execute('UPDATE [DiscogsDatabase].[dbo].[ARTIST] SET ArtistRealName = ? WHERE ArtistID = ?', elemtext, key)
                    cursor.commit()
                
                elif elem.tag == 'profile':
                    elemtext = '%s' % elem.text
                    #print(elemtext)
                    cursor.execute('UPDATE [DiscogsDatabase].[dbo].[ARTIST] SET Profile = ? WHERE ArtistID = ?', elemtext.strip(), key)
                    cursor.commit()
                elif elem.tag == 'artists':
                    elem.clear()
        conn.close()
main()
                

There are many XML parsing solutions available in Python, beyond the one you now use. Python 中提供了许多XML 解析解决方案,超出了您现在使用的那个。 They can be generally categorized into " libxml2 based solutions," which rely on converting the XML into a data-structure, and " SAX based solutions," which troll through the (arbitrarily large...) XML structure and call your subroutines at strategic points.它们通常可以分为“基于libxml2的解决方案”,它依赖于将 XML 转换为数据结构,以及“基于SAX的解决方案”,它遍历(任意大......)XML 结构并在战略上调用您的子例程点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM