[英]Convert Array[(String,String)] type to RDD[(String,String)] type in spark
I am new to spark. 我是新来的火花。
Here is my code: 这是我的代码:
val Data = sc.parallelize(List(
("I", "India"),
("U", "USA"),
("W", "West")))
val DataArray = sc.broadcast(Data.collect)
val FinalData = DataArray.value
Here FinalData
is of Array[(String, String)]
type. 这里的
FinalData
是Array[(String, String)]
类型。 But I want data to be in the form of RDD[(String, String)]
type. 但是我希望数据采用
RDD[(String, String)]
类型的形式。
Can I convert FinalData
to RDD[(String, String)]
type. 我可以将
FinalData
转换为RDD[(String, String)]
类型。
More Detail: 更多详情:
I want to join Two RDD So to optimize join condition(For performance point of view) I am broadcasting small RDD to all cluster so that data shuffling will be less.(Indirectly performance will get improved) So for all this I am writting something like this: 我想加入两个RDD,所以为了优化连接条件(从性能的角度来看),我正在向所有群集广播小型RDD,以便减少数据改组。(间接性能将得到改善)因此,我正在写类似这个:
//Big Data
val FirstRDD = sc.parallelize(List(****Data of first table****))
//Small Data
val SecondRDD = sc.parallelize(List(****Data of Second table****))
So defintely I will broadcast Small Data set(means SecondRDD) 因此,我肯定会广播小数据集(意味着SecondRDD)
val DataArray = sc.broadcast(SecondRDD.collect)
val FinalData = DataArray.value
//Here it will give error that //这里会给出错误提示
val Join = FirstRDD.leftOuterJoin(FinalData)
Found Array required RDD
找到阵列所需的RDD
That's why I am looking for Array to RDD conversion. 这就是为什么我要从Array转换为RDD。
The real is problem is that you are creating a Broadcast
variable, by collecting the RDD
(notice that this action converts the RDD
into an Array
). 真正的问题是,您正在通过收集
RDD
创建一个Broadcast
变量(请注意,此操作会将RDD
转换为Array
)。 So, what I'm saying is that you already have an RDD
, which is Data
, and this variable has exactly the same values as FinalData
, but in the form you want RDD[(String, String)]
. 因此,我要说的是您已经有一个
RDD
,即Data
,并且此变量的值与FinalData
完全相同,但是您需要RDD[(String, String)]
。
You can check this in the following output. 您可以在以下输出中进行检查。
Data: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[2] at parallelize at <console>:32
DataArray: org.apache.spark.broadcast.Broadcast[Array[(String, String)]] = Broadcast(1)
FinalData: Array[(String, String)] = Array((I,India), (U,USA), (W,West))
Although, I don't understand your approach You just need to parallelize the Broadcast
's value. 尽管,我不理解您的方法,您只需要并行化
Broadcast
的价值即可。
// You already have this data stored in `Data`, so it's useless repeat this process.
val DataCopy = sc.parallelize(DataArray.value)
After reading your question again, I believe the problem is almost the same. 再次阅读您的问题后,我相信问题几乎相同。 You are trying to
join
an RDD
with a Broadcast
and that's not allowed. 您正在尝试通过
Broadcast
join
RDD
,这是不允许的。 However, if you read the documentation you may notice that it's possible to join both RDD
s (see code below). 但是,如果您阅读了文档,您可能会注意到可以将两个
RDD
都加入 (请参见下面的代码)。
val joinRDD = FirstRDD.keyBy(_._1).join(SecondRDD.keyBy(_._1))
Broadcasts are indeed useful to improve performance of a JOIN between a large RDD and a smaller one. 广播对于提高大型RDD和较小RDD之间的JOIN性能确实有用。 When you do that, broadcast (along with
map
or mapPartitions
) replaces the join, it's not used in a join, and therefore in no way you'll need to " transform a broadcast into an RDD ". 当您执行此操作时,广播(连同
map
或mapPartitions
) 将替换 mapPartitions
,它不会在 mapPartitions
使用,因此,您绝不需要“ 将广播转换为RDD ”。
Here's how it would look: 外观如下:
val largeRDD = sc.parallelize(List(
("I", "India"),
("U", "USA"),
("W", "West")))
val smallRDD = sc.parallelize(List(
("I", 34),
("U", 45)))
val smaller = sc.broadcast(smallRDD.collectAsMap())
// using "smaller.value" inside the function passed to RDD.map ->
// on executor side. Broadcast made sure it's copied to each executor (once!)
val joinResult = largeRDD.map { case (k, v) => (k, v, smaller.value.get(k)) }
joinResult.foreach(println)
// prints:
// (I,India,Some(34))
// (W,West,None)
// (U,USA,Some(45))
See a similar solution (using mapPartitions
) which might be more efficient here . 请参阅类似的解决方案(使用
mapPartitions
), 在这里可能会更有效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.