简体   繁体   中英

How to speed up del

We have a massive pandas dataframe in our code - shape is (102730344, 50). In order to free up memory, we put in a del of this dataframe once it's no longer needed. That del statement is taking 4 hours to run currently on powerful hardware. Is there a way to speed this up?

Here's the code flow:

big_data_df, small_df, medium_data, smaller_df = get_data(params)
#commented out code
del big_data_df # this takes 4 hours

So we call a function that returns 4 dataframes, one of which is the big dataframe we want to later delete. We've commented out the code between getting the dataframe and deleting it when no longer needed for testing. The del then runs, and a logging statement following that execution shows a runtime of 4 hours.

You could create the large dataframe in a subprocess, but only send the stuff you want to the parent, and then use os_exit() to skip individual object cleanup. Whether this works for you depends on the relative size of the data being returned. In your case, the SQL and dataframe creation / processing could potentially be done in the subprocess. In this example, I send the result on stdout , but it would also be reasonable to save to a temporary file instead. I'm using pickle, but other serializers such as pyarrow may be faster.

....and it may not work in you case at all.

dfuser.py

import sys
import subprocess as subp
import pandas as pd

try:
    proc = subp.Popen([sys.executable, 'dfprocessor.py'], stdin=subp.PIPE, stdout=subp.PIPE, stderr=None)
    df = pd.read_pickle(proc.stdout, compression=None)
    print("got df")
    proc.stdin.write(b"thanks\n")
    proc.stdin.close()
    proc.wait()
    print(df)
finally:
    print('parent done')

dfcreator.py

import pandas as pd
import sys
import os

try:
    # add your df creation and processing here
    df = pd.util.testing.makeDataFrame()
    small_df = df # your processing makes it smaller
    # send
    small_df.to_pickle(sys.stdout.buffer, compression=None)
    sys.stdout.close()
    # make sure received
    sys.stdin.read(1)
finally:
    # exit without deleting df to save time
    sys.stderr.write("out of here\n")
    os._exit(0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM