使用 scala 在 Spark 中合并两列不同的 DataFrame

Question

我想在一个 DataFrames 中合并来自不同 DataFrames 的两列

我有两个这样的 DataFrame

val ds1 = sc.parallelize(Seq(1,0,1,0)).toDF("Col1")
val ds2 = sc.parallelize(Seq(234,43,341,42)).toDF("Col2")
ds1.show()

+-----+
| Col1|
+-----+
|    0|
|    1|
|    0|
|    1|
+-----+

ds2.show()
+-----+
| Col2|
+-----+
|  234|
|   43|
|  341|
|   42|
+-----+

我想要第三个 dataframe 包含两列 Col1 和 Col2

+-----++-----+
| Col1|| Col2|
+-----++-----+
|    0||  234|
|    1||   43|
|    0||  341|
|    1||   42|
+-----++-----+

我试过联合

val ds3 = ds1.union(ds2)

但是，它将ds2的所有行添加到ds1 。

Answer 1

monotonically_increasing_id <-- 不是Deterministic 。

因此，不能保证您会得到正确的结果

使用RDD和使用zipWithIndex创建密钥更容易

val ds1 = sc.parallelize(Seq(1,0,1,0)).toDF("Col1")
val ds2 = sc.parallelize(Seq(234,43,341,42)).toDF("Col2")

// Convert to RDD with ZIPINDEX < Which will be our key

val ds1Rdd = ds1.rdd.repartition(4).zipWithIndex().map{ case (v,k) => (k,v) }

val ds2Rdd = ds2.as[(Int)].rdd.repartition(4).zipWithIndex().map{ case (v,k) => (k,v) }

// Check How The KEY-VALUE Pair looks

ds1Rdd.collect()

res50: Array[(Long, Int)] = Array((0,0), (1,1), (2,1), (3,0))

res51: Array[(Long, Int)] = Array((0,341), (1,42), (2,43), (3,234))

所以元组的第一个元素是我们的加入键

我们只需加入并重新排列结果 dataframe

val joinedRdd = ds1Rdd.join(ds2Rdd)

val resultrdd = joinedRdd.map(x => x._2).map(x => (x._1 ,x._2))

// resultrdd: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[204] at map at <console>

我们转换为DataFrame

 resultrdd.toDF("Col1","Col2").show()
+----+----+
|Col1|Col2|
+----+----+
|   0| 341|
|   1|  42|
|   1|  43|
|   0| 234|
+----+----+

Answer 2

我认为在这种情况下 concat 是你想要的：

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
   attr_1  attr_2  attr_3  id  quantity
0       0       1     NaN   1        20
1       1       1     NaN   2        23
2       1       1     NaN   3        19
3       0       0     NaN   4        19
4       1     NaN       0   5         8
5       0     NaN       1   6        13
6       1     NaN       1   7        20
7       1     NaN       1   8        25

Answer 3

您可以使用monotonically_increasing_id创建一个额外的列id 。 然后在此列上加入两个数据框。

scala> ds1.show
+----+
|Col1|
+----+
|   1|
|   0|
|   1|
|   0|
+----+


scala> ds2.show
+----+
|Col2|
+----+
| 234|
|  43|
| 341|
|  42|
+----+ 

scala> ds1.withColumn("id", monotonically_increasing_id).join(ds2.withColumn("id", monotonically_increasing_id), "id").drop("id").show
+----+----+
|Col1|Col2|
+----+----+
|   1| 234|
|   0|  42|
|   1| 341|
|   0|  43|
+----+----+

Answer 4

如果您正在执行联合、交集等两个查询或 DataFrame，它们必须是“联合兼容”，这意味着它们是具有兼容数据类型的相同列定义。

如果两个 DataFrame 具有相同的列数，那么最简单的解决方案是使用新的 UnionByName API 而如果有不同的架构，建议在合并之前创建兼容的视图

您可以在下面创建 function 以使其兼容 select 查询

def merge(myCols: Set[String], allCols: Set[String]) = {
    allCols.toList.map(x => x match {
      case x if myCols.contains(x) => col(x)
      case _ => lit(null).as(x)
    })
  }

然后使用合并方法创建兼容的 select 查询，如下所述。

import org.apache.spark.sql.functions._
    val ds1 = sc.parallelize(Seq(1,0,1,0)).toDF("col1")
    val ds2 = sc.parallelize(Seq(234,43,341,42)).toDF("col2")
    val cols1 = ds1.columns.toSet
    val cols2 = ds2.columns.toSet
    val unionCol = cols1 ++ cols2 
    val ds3=ds1.select(merge(cols1, unionCol): _*).unionAll(ds2.select(merge(cols2, unionCol): _*))
    scala> ds3.show
    +----+----+
    |col1|col2|
    +----+----+
    |   1|null|
    |   0|null|
    |   1|null|
    |   0|null|
    |null| 234|
    |null|  43|
    |null| 341|
    |null|  42|
    +----+----+

您还可以使用 unionByName 可以消除列排序问题

val ds3=ds1.select(merge(cols1, unionCol): _*).unionByName(ds2.select(merge(cols2, unionCol): _*))

使用 scala 在 Spark 中合并两列不同的 DataFrame

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-11-13 22:45:39

解决方案2
0 2019-11-13 20:13:49

解决方案3
0 2019-11-13 20:41:22

解决方案4
0 2019-11-13 22:31:35

使用 scala 在 Spark 中合并两列不同的 DataFrame

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-11-13 22:45:39

解决方案2 0 2019-11-13 20:13:49

解决方案3 0 2019-11-13 20:41:22

解决方案4 0 2019-11-13 22:31:35

解决方案1
1 已采纳 2019-11-13 22:45:39

解决方案2
0 2019-11-13 20:13:49

解决方案3
0 2019-11-13 20:41:22

解决方案4
0 2019-11-13 22:31:35