简体   繁体   English

如何将RDD中的一系列元素复制到较小的RDD中

[英]How to copy a range of elements in RDD into a smaller RDD

I have the following RDD[String]: 我有以下RDD [String]:

val rdd = sc.makeRDD(Seq("paul", "jim,", "joe", "mary", "sean", "peter", "lucy")) 

What I would like to be able to do is to generate a smallerRDD by being able to copy a range of lines from the above master rdd into a smaller rdd. 我想做的是通过能够将一系列行从上述主rdd复制到较小的rdd来生成较小的RDD。

Use case: When spinning through RDDs in spark unusual situations can arise, more often than not certain lines/records in RDDs can cause problems. 用例:当在火花中旋转RDD时,可能会出现异常情况,而RDD中的某些行/记录往往会引起问题。

Being able to programatically copy one to the other use a usefull feature indeed as I could not find a canned rdd method to do this. 确实能够以编程方式将一个复制到另一个使用了有用的功能,因为我找不到固定的rdd方法来执行此操作。 see my solution below. 请在下面查看我的解决方案。

val rdd = sc.makeRDD(Seq("paul", "jim", "joe", "mary", "sean", "peter", "lucy")) 

val startIndex = 1
val endIndex = 5
val shortRdd=rdd.zipWithIndex().filter { case (_, idx) => idx >= startIndex && idx <= endIndex }.map(p=>p._1)
shortRdd.count
shortRdd.foreach(println)

Step1: Lets see whats inside the RDD: 步骤1:让我们看看RDD中的内容:

rdd.foreach(println)
peter
lucy
jim
joe
paul
mary
sean

Step2: Apply a transformation to append index, notice the an index value is now applied to each line. 步骤2:将转换应用于附加索引,请注意,索引值现已应用于每行。

rdd.zipWithIndex().foreach(println)
(peter,5)
(jim,1)
(joe,2)
(paul,0)
(mary,3)
(sean,4)
(lucy,6)

Step3: Apply filters on the index position, pull indexes between start and end index position 步骤3:在索引位置上应用过滤器,在开始和结束索引位置之间拉出索引

rdd.zipWithIndex().filter { case (_, idx) => idx >= startIndex && idx <= endIndex }.foreach(println)
(mary,3)
(sean,4)
(jim,1)
(peter,5)
(joe,2)

Step4: Map back to single element in each line 第4步:映射回每行中的单个元素

rdd.zipWithIndex().filter { case (_, idx) => idx >= startIndex && idx <= endIndex }.map(p=>p._1).foreach(println)
mary
jim
joe
peter
sean

I performed this process on RDD with lines of 100k or more without any issues. 我在RDD上执行了此过程,行数为10万或更多,没有任何问题。 Let me know how this performs with larger RDD. 让我知道如何在更大的RDD下执行。

Thats it! 而已! Paul. 保罗

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM