简体   繁体   中英

Is there a more efficient way to parse XML to a database with python?

I am using the ElementTree and pyodbc libraries to parse a rather large (400MB) XML file with python. Currently the code works and the correct information is sent to the correct SQL Server tables. The problem is that it's very slow. I expect parsing of any file this size to be relatively slow, but its about 100,000 rows an hour.

Do I a.) need to suck up the runtime and make use of strong dedicated VMs to process the file. or b.) need to make the code more efficient

import xml.etree.ElementTree as ET
import pyodbc
import time

source_file_artist = 'discogs_20120101_artists.xml'


conn = pyodbc.connect('Driver={SQL Server};'
                      'Server=my server;'
                      'Database=DiscogsDatabase;'
                      'Trusted_Connection=yes;')

def main():
    cursor = conn.cursor()
    with open(source_file_artist, 'r') as sf:
        xml_iter = ET.iterparse(source_file_artist, events=('start', 'end'))
        nameCheck = 0
        key = ''
        for event, elem in xml_iter:
            if event =='start':
                if elem.tag =='namevariations':
                    nameCheck = 2
                elif elem.tag == 'aliases':
                    nameCheck = 3 
                elif elem.tag =='members':
                    nameCheck = 4

            elif event == 'end':
                if elem.tag == 'id':
                    elemtext = '%s' % elem.text
                    #print(elemtext)
                    cursor.execute('INSERT INTO [DiscogsDatabase].[dbo].[ARTIST](ArtistID) VALUES(?)', int(elemtext))
                    cursor.commit()
                    key = elemtext
                    nameCheck = 1

                elif elem.tag == 'name':
                    if nameCheck == 1:      
                        elemtext = '%s' % elem.text
                        #print(elemtext)
                        cursor.execute('UPDATE [DiscogsDatabase].[dbo].[ARTIST] SET ArtistName = ? WHERE ArtistID = ?', elemtext, key)
                        cursor.commit()
                        nameCheck = 0

                    elif nameCheck == 2:
                        elemtext =  '%s' % elem.text
                        #print(elemtext)
                        cursor.execute('INSERT INTO [DiscogsDatabase].[dbo].[NAMEVARIATION](ArtistID, VariationName) VALUES(?, ?)', key, elemtext)
                    
                    elif nameCheck == 3:
                        elemtext = '%s' % elem.text
                        #print(elemtext)
                        cursor.execute('INSERT INTO [DiscogsDatabase].[dbo].[ALIAS](ArtistID, AliasName) VALUES(?, ?)', key, elem.text)
                        cursor.commit()

                    elif nameCheck == 4:
                        elemtext = '%s' % elem.text
                        #print(elemtext)
                        cursor.execute('INSERT INTO [DiscogsDatabase].[dbo].[MEMBERS](ArtistID, MemberName) VALUES(?, ?)', key, elemtext)
                        cursor.commit()

                elif elem.tag == 'realname':
                    elemtext = '%s' % elem.text
                    #print(elemtext)
                    cursor.execute('UPDATE [DiscogsDatabase].[dbo].[ARTIST] SET ArtistRealName = ? WHERE ArtistID = ?', elemtext, key)
                    cursor.commit()
                
                elif elem.tag == 'profile':
                    elemtext = '%s' % elem.text
                    #print(elemtext)
                    cursor.execute('UPDATE [DiscogsDatabase].[dbo].[ARTIST] SET Profile = ? WHERE ArtistID = ?', elemtext.strip(), key)
                    cursor.commit()
                elif elem.tag == 'artists':
                    elem.clear()
        conn.close()
main()
                

There are many XML parsing solutions available in Python, beyond the one you now use. They can be generally categorized into " libxml2 based solutions," which rely on converting the XML into a data-structure, and " SAX based solutions," which troll through the (arbitrarily large...) XML structure and call your subroutines at strategic points.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM