pySpark 在多列上加入 dataframe

Question

I'm using the code below to join and drop duplicated between two dataframes.我正在使用下面的代码在两个数据帧之间加入和删除重复项。 However, get error AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans...Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true;但是，得到错误AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans...Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true;

My df1 has 15 columns and my df2 has 50+ columns.我的 df1 有 15 列，我的 df2 有 50 多列。 How can I join on multiple columns without hardcoding the columns to join on?如何在不硬编码要加入的列的情况下加入多个列？

def join(dataset_standardFalse, dataset,  how='left'):
    final_df = dataset_standardFalse.join(dataset,  how=how)
    repeated_columns = [c for c in dataset_standardFalse.columns if c in dataset.columns]
    for col in repeated_columns:
        final_df = final_df.drop(dataset[col])
    return final_df

Specific example, when comparing the columns of the dataframes, they will have multiple columns in common.具体的例子，当比较数据框的列时，它们将有多个共同的列。 Can I join on the list of cols ?我可以加入cols列表吗？ I need to avoid hard-coding names since the cols would vary by case.我需要避免硬编码名称，因为 cols 会因大小写而异。

cols = set(dataset_standardFalse.columns) & (set(dataset.columns))
print(cols)

Answer 1

IIUC you can join on multiple columns directly if they are present in both the dataframes如果两个数据框中都存在IIUC，您可以直接加入多个列

#This gives you the common columns list from both the dataframes
cols = list(set(dataset_standardFalse.columns) & (set(dataset.columns)))

#Modify your function to specify list of columns for join condition
def join(dataset_standardFalse, dataset,  how='left'):
    cols = list(set(dataset_standardFalse.columns) & (set(dataset.columns)))
    final_df = dataset_standardFalse.join(dataset, cols, how=how)
    repeated_columns = [c for c in dataset_standardFalse.columns if c in dataset.columns]
    for col in repeated_columns:
        final_df = final_df.drop(dataset[col])
    return final_df

When you pass the list of columns in the join condition, the columns should be present in both the dataframes.当您在连接条件中传递列列表时，列应该出现在两个数据框中。 If the column is not present then you should rename the column in the preprocessing step or create the join condition dynamically.如果该列不存在，那么您应该在预处理步骤中重命名该列或动态创建连接条件。

For dynamic column names use this:对于动态列名，请使用：

#Identify the column names from both df
df = df1.join(df2,[col(c1) == col(c2) for c1, c2 in zip(columnDf1, columnDf2)],how='left')

pySpark 在多列上加入 dataframe

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-06-08 06:07:00

pySpark 在多列上加入 dataframe

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-06-08 06:07:00

解决方案1
1 已采纳 2020-06-08 06:07:00