简体   繁体   中英

Memory leak when reading value from a Pandas Dataframe

It seems that pandas triggers a memory leak, when iteritavely copying a value from a dataframe.

At the beginning of each iteration, a dataframe is created by making a copy from the initial dataframe. A second variable is created by copying a single value from the current dataframe.

At the end of each iteration, these two variables are deleted, and the memory used by the current process is printed (at each 1000 iterations). The used memory increases !

I think there might be some implicit copy at some point (probably when reading the dataframe value).

A quick fix to this issue results in applying the Garbage Collector at each iteration, but this is quite an expensive solution: the process is at least 10 times slower.

Is there a clear explanation of why this problem occurs?

import os, gc
import psutil, pandas as pd

N_ITER = 100000
DF_SIZE = 10000

# Define the DataFrame
df = pd.DataFrame(index=range(DF_SIZE), columns=['my_col'])
df['my_col'] = range(DF_SIZE)


def memory_usage():
    """Return the memory usage of the current python process."""
    return psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2


if __name__ == '__main__':

    for i in range(N_ITER):
        df_ind = pd.DataFrame(df.copy())
        val = df_ind.at[4242, 'my_col']  # The line that provokes the leak!

        del df_ind, val  # Useless
        # gc.collect()  # Garbage Collector prevents the leak but is slow

        if (i % 1000) == 0:
            print('Iter {}\t {} MB'.format(i, int(memory_usage())))

Ok, it seems that the actual pain comes from the way df_ind is created.

Using references to the original dataframe df seems to work, but might be a little bit risky if we intend to modify ̀ df_ind .

Using copies of the original dataframe df triggers a memory leak. There might be some implicit copies of useless elements from df . These copied elements are not captured by del , but are captured by gc.collect() . This comes with a time cost since this operation takes time.

Here are listed different attempts to solve this memory leak and their results:

df_ind = df                    # Works! Dangerous since df could be modified

df_ind = copy.copy(df)         # Works! Equivalent to df_ind = df
df_ind = df.copy.deepcopy(df)  # Fails.

df_ind = df.copy(deep=False)   # Works! Equivalent to df_ind = df
df_ind = df.copy(deep=True)    # Fails.

To sum up:

  • If you want to modify the temp dataframe , then don't use pandas . You can use dictionaries or zipped lists to get what you want.

  • If you don't want to modify the temp dataframe , then use pandas with the explicit option df_ind = df.copy(deep=False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM