[英]Ambiguous columns error in pyspark while iteratively joining dataframes
我目前正在編寫代碼,根據每次迭代中對應於兩個數據幀的一組列,多次迭代地加入(左)兩個數據幀。 對於一次迭代,它工作正常,但在第二次迭代中,我收到了不明確的列錯誤。
這是我正在處理的樣本 dataframe
sample_data = [("Amit","","Gupta","36678","M",4000),
("Anita","Mathews","","40299","F",5000),
("Ram","","Aggarwal","42124","M",5000),
("Pooja","Anne","Goel","39298","F",5000),
("Geeta","Banuwala","Brown","12345","F",-2)
]
sample_schema = StructType([
StructField("firstname",StringType(),True),
StructField("middlename",StringType(),True),
StructField("lastname",StringType(),True),
StructField("id", StringType(), True),
StructField("gender", StringType(), True),
StructField("salary", IntegerType(), True)
])
df1 = spark.createDataFrame(data = sample_data, schema = sample_schema)
sample_data = [("Amit", "ABC","MTS","36678",10),
("Ani", "DEF","CS","40299",200),
("Ram", "ABC","MTS","421",40),
("Pooja", "DEF","CS","39298",50),
("Geeta", "ABC","MTS","12345",-20)
]
sample_schema = StructType([
StructField("firstname",StringType(),True),
StructField("Company",StringType(),True),
StructField("position",StringType(),True),
StructField("id", StringType(), True),
StructField("points", IntegerType(), True)
])
df2 = spark.createDataFrame(data = sample_data, schema = sample_schema)
我為此使用的代碼是
def joint_left_custom(df1, df2, cols_to_join, cols_df1_to_keep, cols_df2_to_keep):
resultant_df = None
df1_cols = df1.columns
df2 = df2.withColumn("flag", lit(True))
for i in range(len(cols_to_join)):
joined_df = df1.join(df2, [(df1[col_1] == df2[col_2]) for col_1, col_2 in cols_to_join[i].items()], 'left')
joined_df = joined_df.select(*[df1[column] if column in cols_df1_to_keep else df2[column] for column in cols_df1_to_keep + cols_df2_to_keep])
df1 = (joined_df
.filter("flag is NULL")
.select(df1_cols)
)
resultant_df = (joined_df.filter(col("flag") == True) if i == 0
else resultant_df.filter(col("flag") == True).union(resultant_df)
)
return resultant_df
cols_to_join = [{"id": "id"}, {"firstname":"firstname"}]
cols_df1_to_keep = ["firstname", "middlename", "lastname", "id", "gender", "salary"]
cols_df2_to_keep = ["company", "position", "points"]
x = joint_left_custom(df1, df2, cols_to_join, cols_df1_to_keep, cols_df2_to_keep)
如果我為單次運行執行此代碼但在第二次迭代中再次加入列“firstname”上的行的 rest 它工作正常,這些行在第一次迭代中沒有根據列“id”連接它拋出以下錯誤
列位置#29518、公司#29517、點#29520 不明確。 可能是因為你把幾個Datasets join在一起了,其中有一些Datasets是一樣的。 此列指向其中一個數據集,但 Spark 無法確定是哪一個。 請在加入數據集之前通過
Dataset.as
為不同名稱的數據集添加別名,並使用限定名稱指定列,例如df.as("a").join(df.as("b"), $"a.id" > $"b.id")
。 您還可以將 spark.sql.analyzer.failAmbiguousSelfJoin 設置為 false 以禁用此檢查。
這是您可以如何進行or
有條件加入的示例。
df1.join(df2, on=(df1.id == df2.id) | (df1.firstname == df2.firstname), how='left')
要使條件動態化,您可以使用reduce
來鏈接條件。
from functools import reduce
def chain_join_cond(prev, value):
(lcol, rcol) = list(value.items())[0]
return prev | (df1[lcol] == df2[rcol])
# If your condition is OR, use False for initial condition.
# If your condition is AND, use True for initial condition(and use & to concatenate the conditions.)
cond = reduce(chain_join_cond, cols_to_join, F.lit(False))
# Use the cond for `on` option in join.
# df1.join(df2, on=cond, how='left')
然后要從 df1 或 df2 獲取特定的列集,請使用列表理解生成select
語句。
df = (df1.join(df2, on=cond, how='left')
.select(*[df1[c] for c in cols_df1_to_keep], *[df2[c] for c in cols_df2_to_keep]))
如果你有cols_to_join
作為元組而不是字典,你可以稍微簡化代碼。
cols_to_join = [("id", "id"), ("firstname", "firstname")]
cond = reduce(lambda p, v: p | (df1[v[0]] == df2[v[1]]) , cols_to_join, F.lit(False))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.