简体   繁体   中英

XML to Postgres via python/psycopg2

I have an existing python script that loops through a directory of XML files parsing each file using etree, and inserting data at different points into a Postgres database schema using psycopg2 module. This hacked together script worked just fine but now the amount of data (number and size of XML files) is growing rapidly, and the number of INSERT statements is just not scaling. The largest table in my final database has grown to about ~50 million records from about 200,000 XML files. So my question is, what is the most efficient way to:

  1. Parse data out of XMLs
  2. Assemble row(s)
  3. Insert row(s) to Postgres

Would it be faster to write all the data to a CSV in the correct format and then bulk load the final CSV tables to Postgres using COPY_FROM command?

Otherwise I was thinking about populating some sort of temporary data structure in memory that I could insert into the DB once it reaches a certain size? I am just having trouble arriving at the specifics of how this would work.

Thanks for any insight on this topic, and please let me know if more information is needed to answer my question.

copy_from is the fastest way I found to do bulk inserts. You might be able to get away with streaming the data through a generator to stay away from writing temporary files while keeping memory usage low.

A generator function could assemble rows out of the XML data, then consume that generator with copy_from. You may even want multiple levels of generators such that you can have one which yields records from a single file and another which composes those from all 200,000 files. You'd end up with a single query which will be much faster than 50,000,000.

I wrote an answer here with links to example and benchmark code for setting something similar up.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM