Outer merge on large pandas DataFrames causes MemoryError---how to do “big data” merges with pandas?

Question

I have two pandas DataFrames df1 and df2 with a fairly standard format:

   one  two  three   feature
A    1    2      3   feature1
B    4    5      6   feature2  
C    7    8      9   feature3   
D    10   11     12  feature4
E    13   14     15  feature5 
F    16   17     18  feature6 
...

And the same format for df2 . The sizes of these DataFrames are around 175MB and 140 MB.

merged_df = pd.merge(df1, df2, on='feature', how='outer', suffixes=('','_features'))

I get the following MemoryError:

File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 39, in merge
    return op.get_result()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 217, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 353, in _get_join_info
    sort=self.sort, how=self.how) 
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 559, in _get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)
File "pandas/src/join.pyx", line 187, in pandas.algos.full_outer_join (pandas/algos.c:61680)
  File "pandas/src/join.pyx", line 196, in pandas.algos._get_result_indexer (pandas/algos.c:61978)
MemoryError

Is it possible there is a "size limit" for pandas dataframes when merging? I am surprised that this wouldn't work. Maybe this is a bug in a certain version of pandas?

EDIT: As mentioned in the comments, many duplicates in the merge column can easily cause RAM issues. See: Python Pandas Merge Causing Memory Overflow

The question now is, how can we do this merge? It seems the best way would be to partition the dataframe, somehow.

Answer 1

You can try first filter df1 by unique values, merge and last concat output.

If need only outer join, I think there is memory problem also. But if add some other code for filter output of each loop, it can works.

dfs = []
for val in df.feature.unique():
    df1 = pd.merge(df[df.feature==val], df2, on='feature', how='outer', suffixes=('','_key'))
    #http://stackoverflow.com/a/39786538/2901002
    #df1 = df1[(df1.start <= df1.start_key) & (df1.end <= df1.end_key)]
    print (df1)
    dfs.append(df1)

df = pd.concat(dfs, ignore_index=True)
print (df)

Other solution is use dask.dataframe.DataFrame.merge .

Answer 2

Try specifying a data type for the numeric columns to reduce the size of the existing data frames, such as:

df[['one','two', 'three']] = df[['one','two', 'three']].astype(np.int32)

This should reduce the memory significantly and will hopefully let you preform the merge.

Outer merge on large pandas DataFrames causes MemoryError---how to do “big data” merges with pandas?

Question

2 answers

solution1
2 ACCPTED 2016-10-03 05:55:28

solution2
1 2016-10-03 05:36:28

Outer merge on large pandas DataFrames causes MemoryError---how to do “big data” merges with pandas?

Question

2 answers

solution1 2 ACCPTED 2016-10-03 05:55:28

solution2 1 2016-10-03 05:36:28

solution1
2 ACCPTED 2016-10-03 05:55:28

solution2
1 2016-10-03 05:36:28