简体   繁体   中英

How to join datasets with same columns and select one?

I have two Spark dataframes which I am joining and selecting afterwards. I want to select a specific column of one of the Dataframes. But the same column name exists in the other one. Therefore I am getting an Exception for ambiguous column.

I have tried this:

d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id", "left").select($"d1.columnName")

and this:

d1.join(d2, d1("id") === d2("id"), "left").select($"d1.columnName")

but it does not work.

This happens because when spark combines the columns from the two DataFrames it doesn't do any automatic renaming for you. You just need to rename one of the columns before joining. Spark provides a method for this. After the join you can drop the renamed column.

val df2join = df2.withColumnRenamed("id", "join_id")
val joined = df1.join(df2, $"id" === $"join_id", "left").drop("join_id")

I have two dataframes

val d1 = spark.range(3).withColumn("columnName", lit("d1"))
scala> d1.printSchema
root
 |-- id: long (nullable = false)
 |-- columnName: string (nullable = false)

val d2 = spark.range(3).withColumn("columnName", lit("d2"))
scala> d2.printSchema
root
 |-- id: long (nullable = false)
 |-- columnName: string (nullable = false)

which I am joining and selecting afterwards. I want to select a specific column of one of the Dataframes. But the same column name exists in the other one.

val q1 = d1.as("d1")
  .join(d2.as("d2"), Seq("id"), "left")
  .select("d1.columnName")
scala> q1.show
+----------+
|columnName|
+----------+
|        d1|
|        d1|
|        d1|
+----------+

As you can see it just works.


So, why did it not work for you? Let's analyze each.

// you started very well
d1.as("d1")
  // but here you used $ to reference a column to join on
  // with column references by their aliases
  // that won't work
  .join(d2.as("d2"), $"d1.id" === $"d2.id", "left")
  // same here
  // $ + aliased columns won't work
  .select($"d1.columnName")

PROTIP : Use d1("columnName") to reference a specific column in a dataframe.

The other query was very close to be fine, but...

d1.join(d2, d1("id") === d2("id"), "left") // <-- so far so good!
  .select($"d1.columnName") // <-- that's the issue, i.e. $ + aliased column

which spark version you're using ? can you put a sample of your dataframes ? try this:

d2prim = d2.withColumnRenamed("columnName", d2_columnName)   
d1.join(d2prim , Seq("id"), "left_outer").select("columnName")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM