在 Pandas 中合并两个大 dataframe

Question

How should i merge the label column in dataframe df (8 million rows) to another dataframe df2 (143 million rows) when the data size is that large?当数据大小如此之大时，我应该如何将 dataframe df （800 万行）中的label列合并到另一个 dataframe df2 （1.43 亿行）？

Basically I just want to map the label column to df2 , all the data in df is included in df2 except the label column.基本上我只想 map label列到df2 ，除label列之外， df中的所有数据都包含在df2中。 Is there anyway I can solve this issue instead of using merge() ?无论如何我可以解决这个问题而不是使用merge()吗？

Tried to run the code below, but it keeps running for 5 hours but has no response.尝试运行下面的代码，但它一直运行了 5 个小时但没有任何响应。

result = pd.merge(df,df2,on=["X", "Y", "Z"], how='left')
result

df df

df2 df2

Answer 1

The are a few of obvious things I can see here that you can do:我可以在这里看到一些你可以做的显而易见的事情：

Assuming you just want to add the label based on the X / Y / Z columns and R / G / B are superfluous, then drop the R / G / B columns of df as you don't need them in the final data frame, and you certainly don't need them being duplicated 143 million times.假设您只想添加基于X / Y / Z列的B和R / G / B是多余的，然后删除R / G最后一列中的df列你当然不需要它们被复制 1.43 亿次。
Depending on how many unique values of X / Y / Z and their data type, you may be able to reduce the memory footprint by using categorical data types like so:根据X / Y / Z的唯一值数量及其数据类型，您可以通过使用如下分类数据类型来减少 memory 占用空间：

# Convert to categorical data types (if every value is unique, don't bother!)
for df_temp in [df, df2]:
    for col in ['X', 'Y', 'Z']:
        df_temp.loc[:, col] = df_temp[col].astype('category')
# Merge using less memory
result = pd.merge(df, df2, on=["X", "Y", "Z"], how='left')

Finally, you can try partitioning the data and doing a destructive conversion, you create several data frames each containing X in non-overlapping ranges and process them individually, then concatenate the individual results to give you the final result, eg:最后，您可以尝试对数据进行分区并进行破坏性转换，您可以创建多个数据帧，每个数据帧在非重叠范围内都包含X并单独处理它们，然后连接各个结果以提供最终结果，例如：

result_dfs = []
ranges = [0, 1000, 2000, 3000, 4000, ...]
for start, end in zip(ranges[:-1], ranges[1:]):
    df_idx = (df['X'] >= start) & (df['X'] < end)
    df2_idx = (df2['X'] >= start) & (df2['X'] < end)
    result_dfs.append(
        pd.merge(
            df[df_idx], 
            df2[df2_idx], 
            on=["X", "Y", "Z"], 
            how='left'
        )
    )
    # Remove the original data to to reduce memory consumption
    df2 = df2[~df2_idx]
result = pd.concat(result_dfs)

This may still not work though, as you still need the full data set in memory twice for a short while when you do the final concatenation!但这可能仍然行不通，因为当您进行最终连接时，您仍然需要在短时间内将 memory 中的完整数据集两次！

If none of these work, I'm afraid you need more memory, or you need to use something other than Pandas to solve your problem.如果这些都不起作用，恐怕你需要更多的 memory，或者你需要使用 Pandas 以外的东西来解决你的问题。

在 Pandas 中合并两个大 dataframe

问题描述

1 个解决方案

解决方案1
1 2020-07-21 07:29:38

在 Pandas 中合并两个大 dataframe

问题描述

1 个解决方案

解决方案1 1 2020-07-21 07:29:38

解决方案1
1 2020-07-21 07:29:38