I have two pandas DataFrames df1
and df2
with a fairly standard format:
one two three feature
A 1 2 3 feature1
B 4 5 6 feature2
C 7 8 9 feature3
D 10 11 12 feature4
E 13 14 15 feature5
F 16 17 18 feature6
...
And the same format for df2
. The sizes of these DataFrames are around 175MB and 140 MB.
merged_df = pd.merge(df1, df2, on='feature', how='outer', suffixes=('','_features'))
I get the following MemoryError:
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 39, in merge
return op.get_result()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 217, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 353, in _get_join_info
sort=self.sort, how=self.how)
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 559, in _get_join_indexers
return join_func(lkey, rkey, count, **kwargs)
File "pandas/src/join.pyx", line 187, in pandas.algos.full_outer_join (pandas/algos.c:61680)
File "pandas/src/join.pyx", line 196, in pandas.algos._get_result_indexer (pandas/algos.c:61978)
MemoryError
Is it possible there is a "size limit" for pandas dataframes when merging? I am surprised that this wouldn't work. Maybe this is a bug in a certain version of pandas?
EDIT: As mentioned in the comments, many duplicates in the merge column can easily cause RAM issues. See: Python Pandas Merge Causing Memory Overflow
The question now is, how can we do this merge? It seems the best way would be to partition the dataframe, somehow.
You can try first filter df1
by unique
values, merge
and last concat
output.
If need only outer join, I think there is memory problem also. But if add some other code for filter output of each loop, it can works.
dfs = []
for val in df.feature.unique():
df1 = pd.merge(df[df.feature==val], df2, on='feature', how='outer', suffixes=('','_key'))
#http://stackoverflow.com/a/39786538/2901002
#df1 = df1[(df1.start <= df1.start_key) & (df1.end <= df1.end_key)]
print (df1)
dfs.append(df1)
df = pd.concat(dfs, ignore_index=True)
print (df)
Other solution is use dask.dataframe.DataFrame.merge
.
Try specifying a data type for the numeric columns to reduce the size of the existing data frames, such as:
df[['one','two', 'three']] = df[['one','two', 'three']].astype(np.int32)
This should reduce the memory significantly and will hopefully let you preform the merge.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.