I want to create a code that reads several pandas data frames asynchronously, for example from a CSV file (or from a database)
I wrote the following code, assuming that it should import the two data frames faster, however it seems to do it slower:
import timeit
import pandas as pd
import asyncio
train_to_save = pd.DataFrame(data={'feature1': [1, 2, 3],'period': [1, 1, 1]})
test_to_save = pd.DataFrame(data={'feature1': [1, 4, 12],'period': [2, 2, 2]})
train_to_save.to_csv('train.csv')
test_to_save.to_csv('test.csv')
async def run_async_train():
return pd.read_csv('train.csv')
async def run_async_test():
return pd.read_csv('test.csv')
async def run_train_test_asinc():
df = await asyncio.gather(run_async_train(), run_async_test())
return df
start_async = timeit.default_timer()
async_train,async_test=asyncio.run(run_train_test_asinc())
finish_async = timeit.default_timer()
time_to_run_async=finish_async-start_async
start = timeit.default_timer()
train=pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
finish = timeit.default_timer()
time_to_run_without_async = finish - start
print(time_to_run_async<time_to_run_without_async)
Why does it read the two data frames faster in the non-async version?
Just to make it clear, I'm really going to read the data from Bigquery
so im really interested in speeding both requests (train & test) using the code above.
Thanks in advance!
pd.read_csv
isn't an async method, so I don't believe you're actually getting any parallelism out of this. You'd need to use an async file library like aiofiles
to read the files into buffers asynchronously, then send those to pd.read_csv(.)
.
Note that most filesystems aren't really async, so aiofiles
is functionally a thread pool. However it will still likely be faster than reading serially.
Here's an example I had with aiohttp
getting csvs from urls:
import io
import asyncio
import aiohttp
import pandas as pd
async def get_csv_async(client, url):
# Send a request.
async with client.get(url) as response:
# Read entire resposne text and convert to file-like using StringIO().
with io.StringIO(await response.text()) as text_io:
return pd.read_csv(text_io)
async def get_all_csvs_async(urls):
async with aiohttp.ClientSession() as client:
# First create all futures at once.
futures = [ get_csv_async(client, url) for url in urls ]
# Then wait for all the futures to complete.
return await asyncio.gather(*futures)
urls = [
# Some random CSV urls from the internet
'https://people.sc.fsu.edu/~jburkardt/data/csv/hw_25000.csv',
'https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv',
'https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv',
]
if '__main__' == __name__:
# Run event loop
# can just do `csvs = asyncio.run(get_all_csvs_async(urls))` in python 3.7+
csvs = asyncio.get_event_loop().run_until_complete(get_all_csvs_async(urls))
for csv in csvs:
print(csv)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.