[英]Generic coalesce of multiple columns in join pyspark
我必須合並許多 spark DataFrame。 合並后,我想在具有相同名稱的多個列之間執行合並。
我能夠在這個問題之后創建一個最小的例子。
但是,我需要一段更通用的代碼來支持:一組要合並的變量(在示例中set_vars = set(('var1','var2'))
)和多個連接鍵(在示例中join_keys = set(('id'))
)。
在pyspark
是否有更pyspark
(更通用)的方法來獲得這個結果?
df1 = spark.createDataFrame([
( 1, None , "aa"),
( 2 , "a", None ),
( 3 , "b", None),
( 4 , "h", None),],
"id int, var1 string, var2 string",
)
df2 = spark.createDataFrame([
( 1, "f" , "Ba"),
( 2 , "a", "bb" ),
( 3 , "b", None),],
"id int, var1 string, var2 string",
)
df1 = df1.alias("df1")
df2 = df2.alias("df2")
df3 = df1.join(df2, df1.id == df2.id, how='left').withColumn("var1_", coalesce("df1.var1", "df2.var1")).drop("var1").withColumnRenamed("var1_", "var1").withColumn("var2_", coalesce("df1.var2", "df2.var2")).drop("var2").withColumnRenamed("var2_", "var2")
我們可以通過將列作為列表傳遞給連接方法而不是編寫連接條件來避免重復列,請參閱此鏈接。 但是這里有一些不需要加入條件的常見列。 我們可以使用 for 循環來概括您的代碼。
spark = SparkSession.builder.master("local[*]").getOrCreate()
df1 = spark.createDataFrame([
( 1, None , "aa"),
( 2 , "a", None ),
( 3 , "b", None),
( 4 , "h", None),],
"id int, var1 string, var2 string",
)
df2 = spark.createDataFrame([
( 1, "f" , "Ba"),
( 2 , "a", "bb" ),
( 3 , "b", None),],
"id int, var1 string, var2 string",
)
df1 = df1.alias("df1")
df2 = df2.alias("df2")
key_columns = ["id"]
# Get common columns between 2 dataframes excluding columns-
# -which are being used in joining conditions
other_common_columns = set(df1.columns).intersection(set(df2.columns))\
.difference(set(key_columns))
outputDF = df1.join(df2, key_columns, how='left')
for i in other_common_columns:
outputDF = outputDF.withColumn(f"{i}_", coalesce(f"df1.{i}", f"df2.{i}"))\
.drop(i).withColumnRenamed(f"{i}_", i)
outputDF.show()
+---+----+----+
| id|var2|var1|
+---+----+----+
| 1| aa| f|
| 3|null| b|
| 4|null| h|
| 2| bb| a|
+---+----+----+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.