So I've been looking into multiproccessing or parallel processes in Python to perform about a dozen or so SQL queries. Right now, the queries are done serially, and it takes about 4 minutes, where 1 query takes as long as the other 11 to do. So theoretically I could cut my total run time in half at least if I could run all the queries in parallel.
I'm trying to do something along the lines of the following and I haven't been able to find documentation supporting if its really possible with my current thought processes:
So, say I have:
SSMS_query1 = "SELECT * FROM TABLE1"
SSMS_query2 = "SELECT * FROM TABLE2"
HANADB_query3 = "SELECT * FROM TABLE3"
So to connect to SSMS I use:
import pyodbc
server = "server_name"
cnxn = pyodbc.connect("DRIVER={SQL Server};SERVER=" + server + ";trusted_connection=Yes")
Then to connect to my HANAdb's I use:
from hdbcli import dbapi
conn = dbapi.connect(address="", port=, user="", password="")
Then essentially I want to do something where I can take advantage of pooling to save time, like:
import pandas as pd
with cnxn, conn as ssms, hana:
df1 = pd.read_sql(SSMS_query1, ssms)
df2 = pd.read_sql(SSMS_query2, ssms)
df3 = pd.read_sql(HANADB_query3, hana)
I've tried using:
import multiprocessing
import threading
But I can't get the desired output, because eventually I want to output df1, df2, and df3 to excel. So how do I store the dataframes and use them as output later on using parallelism?
I would think that multithreading might be more efficient than multiprocessing not knowing precisely how large the dataframes being created are since in general with multiprocessing there is a lot more overhead in moving results from a child process back to the main process. But since the queries take 4 minutes, I have to assume the amount of data is fairly large. Besides, much of the time spent is in network activity for which multithreading is well-suited.
Here I am assuming the worst case where a database connection cannot be shared among threads. If that is not the case, then create only one connection and use it for all submitted tasks:
from multiprocessing.pool import ThreadPool
import time
import pandas as pd
import pyodbc
def run_sql(conn, sql):
return pd.read_sql(sql, conn)
def main():
SSMS_query1 = "SELECT * FROM TABLE1"
SSMS_query2 = "SELECT * FROM TABLE2"
HANADB_query3 = "SELECT * FROM TABLE3"
queries = (SSMS_query1, SSMS_query2, HANADB_query3)
n_queries = len(queries)
server = "server_name"
connections = [
pyodbc.connect("DRIVER={SQL Server};SERVER=" + server + ";trusted_connection=Yes")
for _ in range(n_queries)
]
t0 = time.time()
# One thread per query:
with ThreadPool(n_queries) as pool:
results = pool.starmap(run_sql, zip(connections, queries))
t1 = time.time()
print(results)
print(t1 - t0)
if __name__ == '__main__':
main()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.