用scala spark连接两个数据框

Question

我有两个数据框：

第一个数据帧DFNum有 48 列和 58500 行。
第二个数据DFString有 7 列和 58500 行。

两个数据帧的列都彼此不同。 我的目标只是将两个数据帧连接成一个具有 55 列 (48 + 7) 并且始终保持 58500 行的数据框，并保持它们在连接之前的顺序。

我做了几次尝试，也阅读了其他问题，但没有成功。 特别是我试过：

val df = DFNum.join(DFString)这给了我以下错误： Detected implicit cartesian product for INNER join between logical plans. Join condition is missing or trivial. Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true; Detected implicit cartesian product for INNER join between logical plans. Join condition is missing or trivial. Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true; .

显然，通过交叉连接，我得到了比我想要的多得多的行：58500 * 58500 行。

然后我尝试编辑 df 向两个数据帧添加相等的列id以加入： val tmpNum = DFNum.withColumn("id", monotonically_increasing_id()) val tmpString = DFString.withColumn("id", monotonically_increasing_id())

并使用：

val df = tmpNum.join(tmpString)这给了我以下错误： USING column `id` cannot be resolved on the left side of the join. The left-side columns:[...] USING column `id` cannot be resolved on the left side of the join. The left-side columns:[...] 。

我还尝试了几种类型的连接（都带有tmpNum和tmpString以及DFNum和DFString ），例如： val df = tmpNum.join(tmpString, Seq("id"), "outer") val df = tmpNum.join(tmpString, Seq("id"), "full_outer")等，但我总是得到同样的错误USING column `id` cannot be resolved on the left side of the join. The left-side columns:[...] USING column `id` cannot be resolved on the left side of the join. The left-side columns:[...] 。

（显然，使用tmpNum和tmpString ，新数据tmpString的总列数将增加一列。稍后我将删除id列）。

如果有人有任何想法或建议，我将不胜感激。

Answer 1

如果您没有任何键列来加入 2 个数据帧，那么您可能依赖于monotonically_increasing_id

val a = Seq(("First",1), ("Secound",2), ("Third",3), ("Fourth",4)).toDF("col1", "col2")
val b = Seq(("india",980), ("japan",990), ("korea",1000), ("chaina",900)).toDF("col3", "col4")

a.show

+-------+----+
|   col1|col2|
+-------+----+
|  First|   1|
|Secound|   2|
|  Third|   3|
| Fourth|   4|
+-------+----+

b.show
+------+----+
|  col3|col4|
+------+----+
| india| 980|
| japan| 990|
| korea|1000|
|chaina| 900|
+------+----+

然后向两个数据框添加一个新列。 确保您的数据框排序正确，否则加入数据框后数据会混乱。

val a1 = a.withColumn("id", monotonically_increasing_id)
val b1 = b.withColumn("id", monotonically_increasing_id)

现在使用id列连接两个数据框，然后删除中间id列

a1.join(b1, Seq("id")).drop("id").show 

+-------+----+------+----+
|   col1|col2|  col3|col4|
+-------+----+------+----+
|  First|   1| india| 980|
|Secound|   2| japan| 990|
|  Third|   3| korea|1000|
| Fourth|   4|chaina| 900|
+-------+----+------+----+

Answer 2

除非使用笛卡尔，否则无法在没有匹配数据的情况下连接两个数据集。 列名不必相同，但列中的值必须相同。 如果需要，您可以使用它们的所有列连接 2 个数据框。

val df1 = //a data frame with columns col1-1, col1-2, col1-3
val df2 = //a data frame with columns col2-1, col2-2, col2-3

val dfJoined = df1.join(df2, df1.col1-1===df2.col2-1 or df1.col1-2===df2.col2-2 or df1.col1-3===df2.col2-3)

//Then drop one set of columns if they have same data.

Answer 3

我最近尝试这样做，但根本没有成功。 您可以尝试将两个对象转换为 Pandas 数据帧，然后进行合并。

第1步：

df1= df1.select("*").toPandas()
df2= df2.select("*").toPandas()

第2步：

result = pd.concat([df1, df2], axis=1)

祝你好运！！

用scala spark连接两个数据框

问题描述

3 个解决方案

解决方案1
2 已采纳 2020-02-12 09:12:11

解决方案2
0 2020-02-11 22:34:18

解决方案3
0 2020-02-13 03:53:45

用scala spark连接两个数据框

问题描述

3 个解决方案

解决方案1 2 已采纳 2020-02-12 09:12:11

解决方案2 0 2020-02-11 22:34:18

解决方案3 0 2020-02-13 03:53:45

解决方案1
2 已采纳 2020-02-12 09:12:11

解决方案2
0 2020-02-11 22:34:18

解决方案3
0 2020-02-13 03:53:45