简体   繁体   English

在火花中将Array [(String,String)]类型转换为RDD [(String,String)]类型

[英]Convert Array[(String,String)] type to RDD[(String,String)] type in spark

I am new to spark. 我是新来的火花。

Here is my code: 这是我的代码:

val Data = sc.parallelize(List(
      ("I", "India"), 
      ("U", "USA"), 
      ("W", "West"))) 

val DataArray = sc.broadcast(Data.collect)

val FinalData = DataArray.value

Here FinalData is of Array[(String, String)] type. 这里的FinalDataArray[(String, String)]类型。 But I want data to be in the form of RDD[(String, String)] type. 但是我希望数据采用RDD[(String, String)]类型的形式。

Can I convert FinalData to RDD[(String, String)] type. 我可以将FinalData转换为RDD[(String, String)]类型。

More Detail: 更多详情:

I want to join Two RDD So to optimize join condition(For performance point of view) I am broadcasting small RDD to all cluster so that data shuffling will be less.(Indirectly performance will get improved) So for all this I am writting something like this: 我想加入两个RDD,所以为了优化连接条件(从性能的角度来看),我正在向所有群集广播小型RDD,以便减少数据改组。(间接性能将得到改善)因此,我正在写类似这个:

//Big Data
val FirstRDD = sc.parallelize(List(****Data of first table****))

//Small Data
val SecondRDD = sc.parallelize(List(****Data of Second table****)) 

So defintely I will broadcast Small Data set(means SecondRDD) 因此,我肯定会广播小数据集(意味着SecondRDD)

val DataArray = sc.broadcast(SecondRDD.collect)

val FinalData = DataArray.value

//Here it will give error that //这里会给出错误提示

val Join = FirstRDD.leftOuterJoin(FinalData)

Found Array required RDD 找到阵列所需的RDD

That's why I am looking for Array to RDD conversion. 这就是为什么我要从Array转换为RDD。

The real is problem is that you are creating a Broadcast variable, by collecting the RDD (notice that this action converts the RDD into an Array ). 真正的问题是,您正在通过收集RDD创建一个Broadcast变量(请注意,此操作会将RDD转换为Array )。 So, what I'm saying is that you already have an RDD , which is Data , and this variable has exactly the same values as FinalData , but in the form you want RDD[(String, String)] . 因此,我要说的是您已经有一个RDD ,即Data ,并且此变量的值与FinalData完全相同,但是您需要RDD[(String, String)]

You can check this in the following output. 您可以在以下输出中进行检查。

Data: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[2] at parallelize at <console>:32
DataArray: org.apache.spark.broadcast.Broadcast[Array[(String, String)]] = Broadcast(1)
FinalData: Array[(String, String)] = Array((I,India), (U,USA), (W,West))

Although, I don't understand your approach You just need to parallelize the Broadcast 's value. 尽管,我不理解您的方法,您只需要并行化Broadcast的价值即可。

// You already have this data stored in `Data`, so it's useless repeat this process.
val DataCopy = sc.parallelize(DataArray.value)

EDIT 编辑

After reading your question again, I believe the problem is almost the same. 再次阅读您的问题后,我相信问题几乎相同。 You are trying to join an RDD with a Broadcast and that's not allowed. 您正在尝试通过Broadcast join RDD ,这是不允许的。 However, if you read the documentation you may notice that it's possible to join both RDD s (see code below). 但是,如果您阅读了文档,您可能会注意到可以将两个RDD加入 (请参见下面的代码)。

val joinRDD = FirstRDD.keyBy(_._1).join(SecondRDD.keyBy(_._1))

Broadcasts are indeed useful to improve performance of a JOIN between a large RDD and a smaller one. 广播对于提高大型RDD和较小RDD之间的JOIN性能确实有用。 When you do that, broadcast (along with map or mapPartitions ) replaces the join, it's not used in a join, and therefore in no way you'll need to " transform a broadcast into an RDD ". 当您执行此操作时,广播(连同mapmapPartitions将替换 mapPartitions ,它不会 mapPartitions使用,因此,您绝不需要“ 将广播转换为RDD ”。

Here's how it would look: 外观如下:

val largeRDD = sc.parallelize(List(
  ("I", "India"),
  ("U", "USA"),
  ("W", "West")))

val smallRDD = sc.parallelize(List(
  ("I", 34),
  ("U", 45)))

val smaller = sc.broadcast(smallRDD.collectAsMap())

// using "smaller.value" inside the function passed to RDD.map ->
// on executor side. Broadcast made sure it's copied to each executor (once!)
val joinResult = largeRDD.map { case (k, v) => (k, v, smaller.value.get(k)) }

joinResult.foreach(println)
// prints:
// (I,India,Some(34))
// (W,West,None)
// (U,USA,Some(45))

See a similar solution (using mapPartitions ) which might be more efficient here . 请参阅类似的解决方案(使用mapPartitions ), 在这里可能会更有效。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 scala 中将 RDD[Array[(String,String)]] 类型转换为 RDD[(String,String)] - Convert RDD[Array[(String,String)]] type to RDD[(String,String)] in scala 将RDD从类型org.apache.spark.rdd.RDD [(((String,String),Double)]`转换为org.apache.spark.rdd.RDD [(((String),List [Double])]]` - Convert RDD from type `org.apache.spark.rdd.RDD[((String, String), Double)]` to `org.apache.spark.rdd.RDD[((String), List[Double])]` 如何映射类型为org.apache.spark.rdd.RDD [Array [String]]的RDD? - How to map a RDD of type org.apache.spark.rdd.RDD[Array[String]]? Array [Byte] Spark RDD转换为String Spark RDD - Array[Byte] Spark RDD to String Spark RDD 将 String 转换为类型列 Spark - Convert String to type column Spark 将RDD [Map [String,String]]转换为Spark数据框 - Convert RDD[Map[String, String]] to Spark dataframe 如何在Spark中将RDD <String>转换为RDD <Vector>? - How to convert a RDD<String> to a RDD<Vector> in Spark? Spark:如何将rdd.RDD [String]转换为rdd.RDD [(Array [Byte],Array [Byte])] - Spark: how to convert rdd.RDD[String] to rdd.RDD[(Array[Byte], Array[Byte])] 将RDD [String]类型的文本拆分为RDD [String]类型的词(Scala,Apache Spark) - Splitting an RDD[String] type text to RDD[String] type words (Scala, Apache Spark) 在 Spark Scala 中将 RDD[(String, String, String)] 转换为 RDD[(String, (String, String))] - Convert RDD[(String, String, String)] to RDD[(String, (String, String))] in Spark Scala
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM