在火花中将Array [（String，String）]类型转换为RDD [（String，String）]类型

Question

I am new to spark. 我是新来的火花。

Here is my code: 这是我的代码：

val Data = sc.parallelize(List(
      ("I", "India"), 
      ("U", "USA"), 
      ("W", "West"))) 

val DataArray = sc.broadcast(Data.collect)

val FinalData = DataArray.value

Here FinalData is of Array[(String, String)] type. 这里的FinalData是Array[(String, String)]类型。 But I want data to be in the form of RDD[(String, String)] type. 但是我希望数据采用RDD[(String, String)]类型的形式。

Can I convert FinalData to RDD[(String, String)] type. 我可以将FinalData转换为RDD[(String, String)]类型。

More Detail: 更多详情：

I want to join Two RDD So to optimize join condition(For performance point of view) I am broadcasting small RDD to all cluster so that data shuffling will be less.(Indirectly performance will get improved) So for all this I am writting something like this: 我想加入两个RDD，所以为了优化连接条件（从性能的角度来看），我正在向所有群集广播小型RDD，以便减少数据改组。（间接性能将得到改善）因此，我正在写类似这个：

//Big Data
val FirstRDD = sc.parallelize(List(****Data of first table****))

//Small Data
val SecondRDD = sc.parallelize(List(****Data of Second table****))

So defintely I will broadcast Small Data set(means SecondRDD) 因此，我肯定会广播小数据集（意味着SecondRDD）

val DataArray = sc.broadcast(SecondRDD.collect)

val FinalData = DataArray.value

//Here it will give error that //这里会给出错误提示

val Join = FirstRDD.leftOuterJoin(FinalData)

Found Array required RDD 找到阵列所需的RDD

That's why I am looking for Array to RDD conversion. 这就是为什么我要从Array转换为RDD。

Answer 1

The real is problem is that you are creating a Broadcast variable, by collecting the RDD (notice that this action converts the RDD into an Array ). 真正的问题是，您正在通过收集RDD创建一个Broadcast变量（请注意，此操作会将RDD转换为Array ）。 So, what I'm saying is that you already have an RDD , which is Data , and this variable has exactly the same values as FinalData , but in the form you want RDD[(String, String)] . 因此，我要说的是您已经有一个RDD ，即Data ，并且此变量的值与FinalData完全相同，但是您需要RDD[(String, String)] 。

You can check this in the following output. 您可以在以下输出中进行检查。

Data: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[2] at parallelize at <console>:32
DataArray: org.apache.spark.broadcast.Broadcast[Array[(String, String)]] = Broadcast(1)
FinalData: Array[(String, String)] = Array((I,India), (U,USA), (W,West))

Although, I don't understand your approach You just need to parallelize the Broadcast 's value. 尽管，我不理解您的方法，您只需要并行化Broadcast的价值即可。

// You already have this data stored in `Data`, so it's useless repeat this process.
val DataCopy = sc.parallelize(DataArray.value)

EDIT 编辑

After reading your question again, I believe the problem is almost the same. 再次阅读您的问题后，我相信问题几乎相同。 You are trying to join an RDD with a Broadcast and that's not allowed. 您正在尝试通过Broadcast join RDD ，这是不允许的。 However, if you read the documentation you may notice that it's possible to join both RDD s (see code below). 但是，如果您阅读了文档，您可能会注意到可以将两个RDD都加入（请参见下面的代码）。

val joinRDD = FirstRDD.keyBy(_._1).join(SecondRDD.keyBy(_._1))

Answer 2

Broadcasts are indeed useful to improve performance of a JOIN between a large RDD and a smaller one. 广播对于提高大型RDD和较小RDD之间的JOIN性能确实有用。 When you do that, broadcast (along with map or mapPartitions ) replaces the join, it's not used in a join, and therefore in no way you'll need to " transform a broadcast into an RDD ". 当您执行此操作时，广播（连同map或mapPartitions ） 将替换 mapPartitions ，它不会在 mapPartitions使用，因此，您绝不需要“ 将广播转换为RDD ”。

Here's how it would look: 外观如下：

val largeRDD = sc.parallelize(List(
  ("I", "India"),
  ("U", "USA"),
  ("W", "West")))

val smallRDD = sc.parallelize(List(
  ("I", 34),
  ("U", 45)))

val smaller = sc.broadcast(smallRDD.collectAsMap())

// using "smaller.value" inside the function passed to RDD.map ->
// on executor side. Broadcast made sure it's copied to each executor (once!)
val joinResult = largeRDD.map { case (k, v) => (k, v, smaller.value.get(k)) }

joinResult.foreach(println)
// prints:
// (I,India,Some(34))
// (W,West,None)
// (U,USA,Some(45))

See a similar solution (using mapPartitions ) which might be more efficient here . 请参阅类似的解决方案（使用mapPartitions ），在这里可能会更有效。

在火花中将Array [（String，String）]类型转换为RDD [（String，String）]类型

问题描述

2 个解决方案

解决方案1
3 2016-09-21 11:58:03

EDIT 编辑

解决方案2
0 已采纳 2016-09-21 13:06:31

在火花中将Array [（String，String）]类型转换为RDD [（String，String）]类型

问题描述

2 个解决方案

解决方案1 3 2016-09-21 11:58:03

EDIT 编辑

解决方案2 0 已采纳 2016-09-21 13:06:31

解决方案1
3 2016-09-21 11:58:03

解决方案2
0 已采纳 2016-09-21 13:06:31