Memory leak in pandas when dropping dataframe column?

Question

I have some code like the following

df = ..... # load a very large dataframe
good_columns = set(['a','b',........]) # set of "good" columns we want to keep
columns = list(df.columns.values)
for col in columns:
   if col not in good_columns:
      df = df.drop(col, 1)

The odd thing is that it successfully drops the first column that is not good - so it isn't an issue where I am holding the old and new dataframe in memory at the same time and running out of space. It breaks on the second column being dropped (MemoryError). This makes me suspect there is some kind of memory leak. How would I prevent this error from happening?

Answer 1

It may be that your constantly returning a new and very large dataframe. Try setting the drop inplace parameter to True.

Answer 2

Make use of usecols argument while reading the large data frame to keep the columns you want instead of dropping them later on. Check here : http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.parsers.read_csv.html

Answer 3

I tried the inplace=True argument but still had the same issues. Here's another solution dealing with the memory leak due to your architecture. That helped me when I had this same issue

Memory leak in pandas when dropping dataframe column?

Question

3 answers

solution1
1 ACCPTED 2015-03-07 00:40:09

solution2
1 2015-03-07 04:55:04

solution3
0 2017-12-02 05:17:19

Memory leak in pandas when dropping dataframe column?

Question

3 answers

solution1 1 ACCPTED 2015-03-07 00:40:09

solution2 1 2015-03-07 04:55:04

solution3 0 2017-12-02 05:17:19

solution1
1 ACCPTED 2015-03-07 00:40:09

solution2
1 2015-03-07 04:55:04

solution3
0 2017-12-02 05:17:19