[英]Subsetting RDDs in Spark- Python
I have an RDD
of LabledPoints
. 我有一个
LabledPoints
RDD
。 Is it possible to select a subset of it based on a list of indeces? 是否可以根据索引列表选择子集?
For example with idx=[0,4,5,6,8]
, I'd like to be able to get a new RDD with elements 0,4,5,6 and 8. 例如,对于
idx=[0,4,5,6,8]
,我希望能够使用元素0、4、5、6和8获得新的RDD。
Note that I am not interested about random samples, which is available. 请注意,我对可用的随机样本不感兴趣。
Yes, you can either: 是的,您可以:
Choose 1 if the list of values is large, else 2. 如果值列表很大,请选择1,否则选择2。
Edit to show a code sample for case 1. 编辑以显示案例1的代码示例。
val filteringValues = //read the list of values, same as you do your points, just easier
.keyBy(_)
val filtered = parsedData
.keyBy(_.something) // Get the number from your inner structure
.rigthOuterJoin(filteringValues) // This select only from your subset
.flatMap(x => x._2._1) // Map it back to the original type.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.