Python with postgres using named variables and bulk inserts

Question

I need some help understanding how Python and postgres handle transactions and bulk inserts specifically when inserting several data sets in a single transaction. Environment:

Windows 7 64bit
Python 3.2
Postgresql 9.1
psycopg2

Here is my scenario: I am converting data from one database(oracle) into xml strings and inserting that data into a new database(postgres). This is a large dataset so I'm trying to optimize some of my inserts. A lot of this data I'm considering library type objects, so I have a library table and then tables for my xml metadata and xml content, the fields for this data are text types in the database. I pull the data out of oracle and then I am creating dictionaries of the data I need to insert. I have 3 insert statements, the first insert creates a record in the library table using a serial id, and that id is necessary for the relationship in the next two queries that insert the xml into the metadata and content tables. Here is an example of what I'm talking about:

for inputKey in libDataDict.keys():
  metaString = libDataDict[inputKey][0]
  contentString = libDataDict[inputKey][1]
  insertLibDataList.append({'objIdent':"%s" % inputKey, 'objName':"%s" % inputKey, objType':libType})
  insertMetadataDataList.append({'objIdent':inputKey,'objMetadata':metaString}) 
  insertContentDataList.append({'objIdent':inputKey, 'objContent':contentString})

dataDict['cmsLibInsert'] = insertLibDataList
dataDict['cmsLibMetadataInsert'] = insertMetadataDataList
dataDict['cmsLibContentInsert'] = insertContentDataList

sqlDict[0] = {'sqlString':"insert into cms_libraries (cms_library_ident, cms_library_name, cms_library_type_id, cms_library_status_id) \
              values (%(objIdent)s, %(objName)s, (select id from cms_library_types where cms_library_type_name = %(objType)s), \
              (select id from cms_library_status where cms_library_status_name = 'active'))", 'data':dataDict['cmsLibInsert']}

sqlDict[1] = {'sqlString':"insert into cms_library_metadata (cms_library_id, cms_library_metadata_data) values \
              ((select id from cms_libraries where cms_library_ident = %(objIdent)s), $$%(objMetadata)s$$)", \
              'data':dataDict['cmsLibMetadataInsert']}

sqlDict[2] = {'sqlString':"insert into cms_library_content (cms_library_id, cms_library_content_data) values \
              ((select id from cms_libraries where cms_library_ident = %(objIdent)s), $$%(objContent)s$$)", \
              'data':dataDict['cmsLibContentInsert']}

bulkLoadData(myConfig['pgConn'], myConfig['pgCursor'], sqlDict)

The problem I have is when I run the first query(sqlDict[0]) and do the insert everything works fine as long as I do it separate and commit before I run the next two. Ideally I would like all these queries in the same transaction, but it fails because it can't find the id from cms_libraries table for the 2nd and 3rd queries. Here is my current insert code:

def bulkLoadData(dbConn, dbCursor, sqlDict):
 try:
   libInsertSql = sqlDict.pop(0)
   dbSql = libInsertSql['sqlString']
   data = libInsertSql['data']
   dbCursor.executemany(dbSql, data)
   dbConn.commit()
   for sqlKey in sqlDict:
     dbSql = sqlDict[sqlKey]['sqlString']
     data = sqlDict[sqlKey]['data']
     dbCursor.executemany(dbSql, data)

   dbConn.commit()

Previously I was appending the values into the query and then running a query for each insert. When I do that I can put it all in the same transaction and it finds the generated id and everything is fine. I don't understand why it doesn't find the id when I do the bulk insert with executemany()? Is there a way to do the bulk insert and the other two queries in the same transaction?

I have been reading this documentation and searching stackoverflow and the internet but have not found an answer to my problem: pyscopg docs as well as postgres's: Postgresql string docs

Any help, suggestions, or comments would be appreciated. Thanks, Mitch

Answer 1

You have two choices here. Either generate the IDs externally (which allows you to do your bulk inserts) or generate them from the serial (which means you have to do single entry inserts). I think it's pretty straight-forward figuring out external ID generation and bulk loading (although I'd recommend you take a look at an ETL tool rather than hand-coding something in python). If you need to pull IDs from the serial, then you should consider server-side prepared statements .

Your first statement should look like the following:

dbCursor.execute("""
PREPARE cms_lib_insert (bigint, text, text) AS 
INSERT INTO cms_libraries (cms_library_ident, cms_library_name, cms_library_type_id, cms_library_status_id)
VALUES ($1, $2,
    (select id from cms_library_types where cms_library_type_name = $3), 
    (select id from cms_library_status where cms_library_status_name = 'active')
)
RETURNING cms_library.id
""")

You'll run this once, at startup time. Then you'll want to be running the following EXECUTE statement on a per-entry level.

dbCursor.execute("""
EXECUTE cms_lib_insert(%(objIndent)s, %(objName)s, %(objType)s)
""", {'objIndent': 345, 'objName': 'foo', 'objType': 'bar'))
my_new_id = dbCursor.fetchone()[0]

This will return the generated serial id. Going forward, I'd strongly recommend that you get away from the pattern that you're currently following of attempting to abstract the database communications (your sqlDict approach) and go with very direct coding pattern (clever is your enemy here, it makes performance tuning harder).

You'll want to batch your inserts into a block size that works for performance. That means tuning your BLOCK_SIZE based on your actual behavior. Your code should look something like the following:

BLOCK_SIZE = 500
while not_done:
   dbCursor.begin()
   for junk in irange(BLOCK_SIZE):
       dbCursor.execute("EXECUTE cms_lib_insert(...)")
       cms_lib_id = dbCursor.fetchone()[0]     # you're using this below.
       dbCursor.executemany("EXECUTE metadata_insert(...)")
       dbCursor.executemany("EXECUTE library_insert(...)")
   dbCursor.commit()

If you need to achieve performance levels higher than this, the next step is building an insert handler function which takes arrays of rows for the dependent tables. I do not recommend doing this as it quickly becomes a maintenance nightmare.

Python with postgres using named variables and bulk inserts

Question

1 answers

solution1
0 ACCPTED 2012-02-22 17:27:26

Python with postgres using named variables and bulk inserts

Question

1 answers

solution1 0 ACCPTED 2012-02-22 17:27:26

solution1
0 ACCPTED 2012-02-22 17:27:26