python asynchronous read_sql in pandas

Question

I want to speed up the process of getting data from database by splitting the query to 4. I wrote the following code using apply_async. However, when using get(), error of pickling appeared. What should I do? Thank you very much.

from multiprocessing import Pool
pool = Pool(processes=4)
start_date = datetime.datetime(2017, 1, 1)
end_date = datetime.datetime(2017, 6, 30)
period = (end_date-start_date)/4
conn = pyodbc.connect(
    r'DRIVER={SQL Server};'
    r'SERVER=abc;'
    r'PORT=111;'
    r'DATABASE=db;'
    r'UID=abc;'
    r'PWD=xyz;'
    r'TDS_Version=7.1'
    )

for p in np.arange(start_date, end_date, period).astype(datetime.datetime):
    sql = "SELECT * FROM db where date between \'" +  str(p) +  "\' and \'" +  str(p + period) + "\'"
    res.append(pool.apply_async(lambda x: pd.read_sql(x[0], con = x[1]), ([sql, conn],)))      # runs in *only* one process
pool.close() 

res[0].get()#<-------PicklingError: Can't pickle <function <lambda> at 0x00000045566BDAE8>: attribute lookup <lambda>

Answer 1

You need to move the connection line into each of the subprocess: replace your "lambda x..." by a routine that will connect to the server and then send the request. You cannot open one single connection and share it between the subprocesses

Alternatively, you can replace pyodbc by aioodbc: https://github.com/aio-libs/aioodbc This will allow you to implement what you need with asyncio

python asynchronous read_sql in pandas

Question

1 answers

solution1
0 2018-06-12 15:50:28

python asynchronous read_sql in pandas

Question

1 answers

solution1 0 2018-06-12 15:50:28

solution1
0 2018-06-12 15:50:28