使用Scala在Apache Spark中连接不同RDD的数据集

Question

Is there a way to concatenate datasets of two different RDD s in spark? 有没有办法在Spark中串联两个不同RDD的数据集？

Requirement is - I create two intermediate RDDs using scala which has same column names, need to combine these results of both the RDDs and cache the result for accessing to UI. 要求是-我使用具有相同列名的scala创建两个中间RDD，需要将两个RDD的这些结果组合在一起并缓存该结果以访问UI。 How do I combine the datasets here? 如何在此处合并数据集？

RDDs are of type spark.sql.SchemaRDD RDD的类型为spark.sql.SchemaRDD

Answer 1

I think you are looking for RDD.union 我认为您正在寻找RDD.union

val rddPart1 = ???
val rddPart2 = ???
val rddAll = rddPart1.union(rddPart2)

Example (on Spark-shell) 示例（在Spark-shell上）

val rdd1 = sc.parallelize(Seq((1, "Aug", 30),(1, "Sep", 31),(2, "Aug", 15),(2, "Sep", 10)))
val rdd2 = sc.parallelize(Seq((1, "Oct", 10),(1, "Nov", 12),(2, "Oct", 5),(2, "Nov", 15)))
rdd1.union(rdd2).collect

res0: Array[(Int, String, Int)] = Array((1,Aug,30), (1,Sep,31), (2,Aug,15), (2,Sep,10), (1,Oct,10), (1,Nov,12), (2,Oct,5), (2,Nov,15))

Answer 2

I had the same problem. 我有同样的问题。 To combine by row instead of column use unionAll: 要按行而不是列进行合并，请使用unionAll：

val rddPart1= ???
val rddPart2= ???
val rddAll = rddPart1.unionAll(rddPart2)

I found it after reading the method summary for data frame. 我在阅读数据框的方法摘要后找到了它。 More information at: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html 有关更多信息，请访问： https : //spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html

使用Scala在Apache Spark中连接不同RDD的数据集

问题描述

2 个解决方案

解决方案1
45 已采纳 2014-12-10 10:21:27

解决方案2
2 2016-05-30 05:58:33

使用Scala在Apache Spark中连接不同RDD的数据集

问题描述

2 个解决方案

解决方案1 45 已采纳 2014-12-10 10:21:27

解决方案2 2 2016-05-30 05:58:33

解决方案1
45 已采纳 2014-12-10 10:21:27

解决方案2
2 2016-05-30 05:58:33