I'm working with a 400.000 rows dataframe (actually, is bigger, but for tests purposes I'm using this dimension).
I need to export to txt/csv multiple files based on two conditions: #RIC and Date.
Looping around these conditions becomes a really slow process, so I'm looking for some faster way to do this.
That's my original idea:
def SaveTxt(df, output_folder=None):
# Start time
start_time = time.time()
# Data Frame with date
df['Date'] = pd.to_datetime(df['Date-Time']).dt.date
dates = df['Date'].unique()
ticks = df['#RIC'].unique()
for tick in ticks:
for date in dates:
# print(date, tick)
# Filtering by instrument and date
temp_df = df[(df['#RIC'] == tick) & (df['Date'] == date)]
if temp_df.empty:
pass
else:
# Saving files
if output_folder in [None, ""]:
temp_df.to_csv("%s_%s.txt" % (date, tick))
else:
temp_df.to_csv("%s\\%s_%s.txt" % (output_folder, date, tick))
# Elapsed time
elapsed_time = time.time() - start_time
elapsed_time = time.strftime("%H:%M:%S", time.gmtime(elapsed_time))
# Priting elapsed time
print('Elapsed time: %s' % elapsed_time)
For 400.000 rows (equivalent of 5 days data) it takes 3 minutes to run this script. For one year, takes 6 hours and I didn't tried with 10 years, but I suppose that is not a good idea.
Solution Idea
I've tried to remove the data used in each loop from df , but this condition is not working (maybe this will remove the size of the data frame and will turn the code faster):
df = df[(df['#RIC'] != tick) & (df['Date'] != date)]
I believe this should remove every tick AND date from the data frame, but it's applying this condition separably.
I'll appreciate if you guys have some solution for this problem.
Thanks
Edit
Don't know if this is the best way to share a sample of the data (I can't upload under a proxy)
DIJF21 16/10/2019 4.64 15 DIJF21 16/10/2019 4.64 40 DIJF21 16/10/2019 4.64 100 DIJF21 16/10/2019 4.64 5 DIJF21 16/10/2019 4.64 1765 DIJF21 16/10/2019 4.64 10 DIJF21 16/10/2019 4.64 100 DIJF21 16/10/2019 4.64 1000 DIJF21 16/10/2019 4.64 5 DIJF21 16/10/2019 4.64 20 DIJF21 16/10/2019 4.64 80 DIJF21 16/10/2019 4.64 25 DIJF21 16/10/2019 4.64 25 DIJF21 16/10/2019 4.64 150 DIJF20 15/10/2019 4.905 2000 DIJF20 15/10/2019 4.905 2000 DIJF20 15/10/2019 4.903 10
I suggest you consider the coroutines https://docs.python.org/3/library/asyncio-task.html
something like that:
import asyncio
df['Date'] = pd.to_datetime(df['Date-Time']).dt.date
dates = df['Date'].unique()
ticks = df['#RIC'].unique()
async def tick_func(tick):
for date in dates:
temp_df = df[(df['#RIC'] == tick) & (df['Date'] == date)]
if temp_df.empty:
pass
else:
if output_folder in [None, ""]:
temp_df.to_csv("%s_%s.txt" % (date, tick))
else:
temp_df.to_csv("%s\\%s_%s.txt" % (output_folder, date, tick))
asyncio.new_event_loop()
asyncio.set_event_loop(asyncio.new_event_loop())
loop = asyncio.get_event_loop()
tasks = [tick_func(tick) for tick in ticks]
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()
I did a quick pass through the question and it seems that the bottleneck is the doubly nested for
loop that you are using to group the data by tick
and date
.
Maybe you could consider performing the groupby
operation in a single function call using the groupby
function . The code would look something like this:
grouped_df = df.groupby(['#RIC', 'Date'])
Print grouped_df
to make sure it looks like the way you expect it to look. Then you can iterate over this grouped dataframe once and save the different groups to the filesystem (as desired).
Please let me know if this works or if you face any other problems.
Edit: To follow up on @Thales comment, there are some online resources that discuss how to save large dataframes to a csv file. From these resources, I like the suggestion of using numpy.
Following is an example (taken from one of the links shared above):
aa.to_csv('pandas_to_csv', index=False)
# 6.47 s
df2csv(aa,'code_from_question', myformats=['%d','%.1f','%.1f','%.1f'])
# 4.59 s
from numpy import savetxt
savetxt(
'numpy_savetxt', aa.values, fmt='%d,%.1f,%.1f,%.1f',
header=','.join(aa.columns), comments=''
)
# 3.5 s
It would be helpful to give a sample of your data to test the answer beforehand. Like this I just hope it works without errors;)
You should be able to use groupby with a custom function that gets applied to each group like this:
def custom_to_csv(temp_df, output_folder):
date, tick = temp_df.name
# Saving files
if output_folder in [None, ""]:
temp_df.to_csv("%s_%s.txt" % (date, tick))
else:
temp_df.to_csv("%s\\%s_%s.txt" % (output_folder, date, tick))
df.groupby(['Date', '#RIC']).apply(custom_to_csv, (output_folder))
EDIT: Changed df
to temp_df
and (output_folder,)
to (output_folder)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.