简体   繁体   中英

Optimizing time in a large loop Pandas to_csv

I'm working with a 400.000 rows dataframe (actually, is bigger, but for tests purposes I'm using this dimension).

I need to export to txt/csv multiple files based on two conditions: #RIC and Date.

Looping around these conditions becomes a really slow process, so I'm looking for some faster way to do this.

That's my original idea:

def SaveTxt(df, output_folder=None):

# Start time
start_time = time.time()
# Data Frame with date
df['Date'] = pd.to_datetime(df['Date-Time']).dt.date
dates = df['Date'].unique()
ticks = df['#RIC'].unique()

for tick in ticks:
    for date in dates:
        # print(date, tick)
        # Filtering by instrument and date
        temp_df = df[(df['#RIC'] == tick) & (df['Date'] == date)]
        if temp_df.empty:
            pass
        else:
            # Saving files
            if output_folder in [None, ""]:
                temp_df.to_csv("%s_%s.txt" % (date, tick))
            else:
                temp_df.to_csv("%s\\%s_%s.txt" % (output_folder, date, tick))


# Elapsed time
elapsed_time = time.time() - start_time
elapsed_time = time.strftime("%H:%M:%S", time.gmtime(elapsed_time))
# Priting elapsed time
print('Elapsed time: %s' % elapsed_time)

For 400.000 rows (equivalent of 5 days data) it takes 3 minutes to run this script. For one year, takes 6 hours and I didn't tried with 10 years, but I suppose that is not a good idea.

Solution Idea

I've tried to remove the data used in each loop from df , but this condition is not working (maybe this will remove the size of the data frame and will turn the code faster):

df = df[(df['#RIC'] != tick) & (df['Date'] != date)]

I believe this should remove every tick AND date from the data frame, but it's applying this condition separably.

I'll appreciate if you guys have some solution for this problem.

Thanks

Edit

Don't know if this is the best way to share a sample of the data (I can't upload under a proxy)


DIJF21  16/10/2019  4.64    15
DIJF21  16/10/2019  4.64    40
DIJF21  16/10/2019  4.64    100
DIJF21  16/10/2019  4.64    5
DIJF21  16/10/2019  4.64    1765
DIJF21  16/10/2019  4.64    10
DIJF21  16/10/2019  4.64    100
DIJF21  16/10/2019  4.64    1000
DIJF21  16/10/2019  4.64    5
DIJF21  16/10/2019  4.64    20
DIJF21  16/10/2019  4.64    80
DIJF21  16/10/2019  4.64    25
DIJF21  16/10/2019  4.64    25
DIJF21  16/10/2019  4.64    150
DIJF20  15/10/2019  4.905   2000
DIJF20  15/10/2019  4.905   2000
DIJF20  15/10/2019  4.903   10

I suggest you consider the coroutines https://docs.python.org/3/library/asyncio-task.html

something like that:

import asyncio


df['Date'] = pd.to_datetime(df['Date-Time']).dt.date
dates = df['Date'].unique()
ticks = df['#RIC'].unique()


async def tick_func(tick):
    for date in dates:
        temp_df = df[(df['#RIC'] == tick) & (df['Date'] == date)]
        if temp_df.empty:
            pass
        else:
            if output_folder in [None, ""]:
                temp_df.to_csv("%s_%s.txt" % (date, tick))
            else:
                temp_df.to_csv("%s\\%s_%s.txt" % (output_folder, date, tick))



asyncio.new_event_loop()
asyncio.set_event_loop(asyncio.new_event_loop())
loop = asyncio.get_event_loop()
tasks = [tick_func(tick) for tick in ticks]
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

I did a quick pass through the question and it seems that the bottleneck is the doubly nested for loop that you are using to group the data by tick and date .

Maybe you could consider performing the groupby operation in a single function call using the groupby function . The code would look something like this:

grouped_df = df.groupby(['#RIC', 'Date'])

Print grouped_df to make sure it looks like the way you expect it to look. Then you can iterate over this grouped dataframe once and save the different groups to the filesystem (as desired).

Please let me know if this works or if you face any other problems.

Edit: To follow up on @Thales comment, there are some online resources that discuss how to save large dataframes to a csv file. From these resources, I like the suggestion of using numpy.

Following is an example (taken from one of the links shared above):

aa.to_csv('pandas_to_csv', index=False)
# 6.47 s

df2csv(aa,'code_from_question', myformats=['%d','%.1f','%.1f','%.1f'])
# 4.59 s

from numpy import savetxt

savetxt(
    'numpy_savetxt', aa.values, fmt='%d,%.1f,%.1f,%.1f',
    header=','.join(aa.columns), comments=''
)
# 3.5 s

It would be helpful to give a sample of your data to test the answer beforehand. Like this I just hope it works without errors;)

You should be able to use groupby with a custom function that gets applied to each group like this:

def custom_to_csv(temp_df, output_folder):
    date, tick = temp_df.name
    # Saving files
    if output_folder in [None, ""]:
        temp_df.to_csv("%s_%s.txt" % (date, tick))
    else:
        temp_df.to_csv("%s\\%s_%s.txt" % (output_folder, date, tick))

df.groupby(['Date', '#RIC']).apply(custom_to_csv, (output_folder))

EDIT: Changed df to temp_df and (output_folder,) to (output_folder)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM