Python - Parallel SQL Queries and return dataframes for each

Question

So I've been looking into multiproccessing or parallel processes in Python to perform about a dozen or so SQL queries. Right now, the queries are done serially, and it takes about 4 minutes, where 1 query takes as long as the other 11 to do. So theoretically I could cut my total run time in half at least if I could run all the queries in parallel.

I'm trying to do something along the lines of the following and I haven't been able to find documentation supporting if its really possible with my current thought processes:

So, say I have:

SSMS_query1 = "SELECT * FROM TABLE1"

SSMS_query2 = "SELECT * FROM TABLE2"

HANADB_query3 = "SELECT * FROM TABLE3"

So to connect to SSMS I use:

import pyodbc
server = "server_name"
cnxn = pyodbc.connect("DRIVER={SQL Server};SERVER=" + server + ";trusted_connection=Yes")

Then to connect to my HANAdb's I use:

from hdbcli import dbapi
conn = dbapi.connect(address="", port=, user="", password="")

Then essentially I want to do something where I can take advantage of pooling to save time, like:

import pandas as pd
with cnxn, conn as ssms, hana:
    df1 = pd.read_sql(SSMS_query1, ssms)
    df2 = pd.read_sql(SSMS_query2, ssms)
    df3 = pd.read_sql(HANADB_query3, hana)

I've tried using:

import multiprocessing
import threading

But I can't get the desired output, because eventually I want to output df1, df2, and df3 to excel. So how do I store the dataframes and use them as output later on using parallelism?

Answer 1

I would think that multithreading might be more efficient than multiprocessing not knowing precisely how large the dataframes being created are since in general with multiprocessing there is a lot more overhead in moving results from a child process back to the main process. But since the queries take 4 minutes, I have to assume the amount of data is fairly large. Besides, much of the time spent is in network activity for which multithreading is well-suited.

Here I am assuming the worst case where a database connection cannot be shared among threads. If that is not the case, then create only one connection and use it for all submitted tasks:

from multiprocessing.pool import ThreadPool
import time
import pandas as pd
import pyodbc

def run_sql(conn, sql):
    return pd.read_sql(sql, conn)

def main():
    SSMS_query1 = "SELECT * FROM TABLE1"
    SSMS_query2 = "SELECT * FROM TABLE2"
    HANADB_query3 = "SELECT * FROM TABLE3"
    
    queries = (SSMS_query1, SSMS_query2, HANADB_query3)
    n_queries = len(queries)

    server = "server_name"
    connections = [
        pyodbc.connect("DRIVER={SQL Server};SERVER=" + server + ";trusted_connection=Yes")
            for _ in range(n_queries)
    ]

    t0 = time.time()
    # One thread per query:
    with ThreadPool(n_queries) as pool:
        results = pool.starmap(run_sql, zip(connections, queries))    
        t1 = time.time()
        print(results)
        print(t1 - t0)

if __name__ == '__main__':
    main()

Python - Parallel SQL Queries and return dataframes for each

Question

1 answers

solution1
0 2022-09-06 19:28:45

Python - Parallel SQL Queries and return dataframes for each

Question

1 answers

solution1 0 2022-09-06 19:28:45

solution1
0 2022-09-06 19:28:45