简体   繁体   中英

python asynchronous read_sql in pandas

I want to speed up the process of getting data from database by splitting the query to 4. I wrote the following code using apply_async. However, when using get(), error of pickling appeared. What should I do? Thank you very much.

from multiprocessing import Pool
pool = Pool(processes=4)
start_date = datetime.datetime(2017, 1, 1)
end_date = datetime.datetime(2017, 6, 30)
period = (end_date-start_date)/4
conn = pyodbc.connect(
    r'DRIVER={SQL Server};'
    r'SERVER=abc;'
    r'PORT=111;'
    r'DATABASE=db;'
    r'UID=abc;'
    r'PWD=xyz;'
    r'TDS_Version=7.1'
    )

for p in np.arange(start_date, end_date, period).astype(datetime.datetime):
    sql = "SELECT * FROM db where date between \'" +  str(p) +  "\' and \'" +  str(p + period) + "\'"
    res.append(pool.apply_async(lambda x: pd.read_sql(x[0], con = x[1]), ([sql, conn],)))      # runs in *only* one process
pool.close() 

res[0].get()#<-------PicklingError: Can't pickle <function <lambda> at 0x00000045566BDAE8>: attribute lookup <lambda>

You need to move the connection line into each of the subprocess: replace your "lambda x..." by a routine that will connect to the server and then send the request. You cannot open one single connection and share it between the subprocesses

Alternatively, you can replace pyodbc by aioodbc: https://github.com/aio-libs/aioodbc This will allow you to implement what you need with asyncio

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM