迭代加入數據幀時 pyspark 中出現歧義列錯誤

Question

我目前正在編寫代碼，根據每次迭代中對應於兩個數據幀的一組列，多次迭代地加入（左）兩個數據幀。 對於一次迭代，它工作正常，但在第二次迭代中，我收到了不明確的列錯誤。

這是我正在處理的樣本 dataframe

sample_data = [("Amit","","Gupta","36678","M",4000),
               ("Anita","Mathews","","40299","F",5000), 
               ("Ram","","Aggarwal","42124","M",5000),  
               ("Pooja","Anne","Goel","39298","F",5000),    
               ("Geeta","Banuwala","Brown","12345","F",-2)  
  ] 
sample_schema = StructType([
    StructField("firstname",StringType(),True),
    StructField("middlename",StringType(),True),
    StructField("lastname",StringType(),True),
    StructField("id", StringType(), True),
    StructField("gender", StringType(), True),
    StructField("salary", IntegerType(), True)
])   
df1 = spark.createDataFrame(data = sample_data, schema = sample_schema) 

sample_data = [("Amit", "ABC","MTS","36678",10),
               ("Ani", "DEF","CS","40299",200), 
               ("Ram", "ABC","MTS","421",40),   
               ("Pooja", "DEF","CS","39298",50),    
               ("Geeta", "ABC","MTS","12345",-20)   

  ] 
sample_schema = StructType([
    StructField("firstname",StringType(),True),
    StructField("Company",StringType(),True),
    StructField("position",StringType(),True),
    StructField("id", StringType(), True),
    StructField("points", IntegerType(), True)
])  
df2 = spark.createDataFrame(data = sample_data, schema = sample_schema)

我為此使用的代碼是

def joint_left_custom(df1, df2, cols_to_join, cols_df1_to_keep, cols_df2_to_keep):
    
    resultant_df = None
    df1_cols = df1.columns
    df2 = df2.withColumn("flag", lit(True))
    
    for i in range(len(cols_to_join)):
        joined_df = df1.join(df2, [(df1[col_1] == df2[col_2]) for col_1, col_2 in cols_to_join[i].items()], 'left')

        joined_df = joined_df.select(*[df1[column] if column in cols_df1_to_keep else df2[column] for column in cols_df1_to_keep + cols_df2_to_keep])

        df1 = (joined_df
               .filter("flag is NULL")
               .select(df1_cols)
              )
        
        resultant_df = (joined_df.filter(col("flag") == True) if i == 0 
                        else resultant_df.filter(col("flag") == True).union(resultant_df)
                       )
        
    return resultant_df

cols_to_join = [{"id": "id"}, {"firstname":"firstname"}]
cols_df1_to_keep = ["firstname", "middlename", "lastname", "id", "gender", "salary"]
cols_df2_to_keep = ["company", "position", "points"]
x = joint_left_custom(df1, df2, cols_to_join, cols_df1_to_keep, cols_df2_to_keep)

如果我為單次運行執行此代碼但在第二次迭代中再次加入列“firstname”上的行的 rest 它工作正常，這些行在第一次迭代中沒有根據列“id”連接它拋出以下錯誤

列位置#29518、公司#29517、點#29520 不明確。 可能是因為你把幾個Datasets join在一起了，其中有一些Datasets是一樣的。 此列指向其中一個數據集，但 Spark 無法確定是哪一個。 請在加入數據集之前通過Dataset.as為不同名稱的數據集添加別名，並使用限定名稱指定列，例如df.as("a").join(df.as("b"), $"a.id" > $"b.id") 。 您還可以將 spark.sql.analyzer.failAmbiguousSelfJoin 設置為 false 以禁用此檢查。

Answer 1

這是您可以如何進行or有條件加入的示例。

df1.join(df2, on=(df1.id == df2.id) | (df1.firstname == df2.firstname), how='left')

要使條件動態化，您可以使用reduce來鏈接條件。

from functools import reduce

def chain_join_cond(prev, value):
    (lcol, rcol) = list(value.items())[0]
    return prev | (df1[lcol] == df2[rcol])

# If your condition is OR, use False for initial condition.
# If your condition is AND, use True for initial condition(and use & to concatenate the conditions.)
cond = reduce(chain_join_cond, cols_to_join, F.lit(False))

# Use the cond for `on` option in join.
# df1.join(df2, on=cond, how='left')

然后要從 df1 或 df2 獲取特定的列集，請使用列表理解生成select語句。

df = (df1.join(df2, on=cond, how='left')
      .select(*[df1[c] for c in cols_df1_to_keep], *[df2[c] for c in cols_df2_to_keep]))

如果你有cols_to_join作為元組而不是字典，你可以稍微簡化代碼。

cols_to_join = [("id", "id"), ("firstname", "firstname")]
cond = reduce(lambda p, v: p | (df1[v[0]] == df2[v[1]]) , cols_to_join, F.lit(False))

迭代加入數據幀時 pyspark 中出現歧義列錯誤

問題描述

1 個解決方案

解決方案1
0 2023-01-30 19:46:24

迭代加入數據幀時 pyspark 中出現歧義列錯誤

問題描述

1 個解決方案

解決方案1 0 2023-01-30 19:46:24

解決方案1
0 2023-01-30 19:46:24