[英]Merging two large dataframe in Pandas
How should i merge the label
column in dataframe df
(8 million rows) to another dataframe df2
(143 million rows) when the data size is that large?当数据大小如此之大时,我应该如何将 dataframe df
(800 万行)中的label
列合并到另一个 dataframe df2
(1.43 亿行)?
Basically I just want to map the label
column to df2
, all the data in df
is included in df2
except the label
column.基本上我只想 map label
列到df2
,除label
列之外, df
中的所有数据都包含在df2
中。 Is there anyway I can solve this issue instead of using merge()
?无论如何我可以解决这个问题而不是使用merge()
吗?
Tried to run the code below, but it keeps running for 5 hours but has no response.尝试运行下面的代码,但它一直运行了 5 个小时但没有任何响应。
result = pd.merge(df,df2,on=["X", "Y", "Z"], how='left')
result
df df
df2 df2
The are a few of obvious things I can see here that you can do:我可以在这里看到一些你可以做的显而易见的事情:
X
/ Y
/ Z
columns and R
/ G
/ B
are superfluous, then drop the R
/ G
/ B
columns of df
as you don't need them in the final data frame, and you certainly don't need them being duplicated 143 million times.假设您只想添加基于X
/ Y
/ Z
列的B
和R
/ G
/ B
是多余的,然后删除R
/ G
最后一列中的df
列你当然不需要它们被复制 1.43 亿次。X
/ Y
/ Z
and their data type, you may be able to reduce the memory footprint by using categorical data types like so:根据X
/ Y
/ Z
的唯一值数量及其数据类型,您可以通过使用如下分类数据类型来减少 memory 占用空间:# Convert to categorical data types (if every value is unique, don't bother!)
for df_temp in [df, df2]:
for col in ['X', 'Y', 'Z']:
df_temp.loc[:, col] = df_temp[col].astype('category')
# Merge using less memory
result = pd.merge(df, df2, on=["X", "Y", "Z"], how='left')
X
in non-overlapping ranges and process them individually, then concatenate the individual results to give you the final result, eg:最后,您可以尝试对数据进行分区并进行破坏性转换,您可以创建多个数据帧,每个数据帧在非重叠范围内都包含X
并单独处理它们,然后连接各个结果以提供最终结果,例如:result_dfs = []
ranges = [0, 1000, 2000, 3000, 4000, ...]
for start, end in zip(ranges[:-1], ranges[1:]):
df_idx = (df['X'] >= start) & (df['X'] < end)
df2_idx = (df2['X'] >= start) & (df2['X'] < end)
result_dfs.append(
pd.merge(
df[df_idx],
df2[df2_idx],
on=["X", "Y", "Z"],
how='left'
)
)
# Remove the original data to to reduce memory consumption
df2 = df2[~df2_idx]
result = pd.concat(result_dfs)
This may still not work though, as you still need the full data set in memory twice for a short while when you do the final concatenation!但这可能仍然行不通,因为当您进行最终连接时,您仍然需要在短时间内将 memory 中的完整数据集两次!
If none of these work, I'm afraid you need more memory, or you need to use something other than Pandas to solve your problem.如果这些都不起作用,恐怕你需要更多的 memory,或者你需要使用 Pandas 以外的东西来解决你的问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.