简体   繁体   English

在Spark- Python中子集RDD

[英]Subsetting RDDs in Spark- Python

I have an RDD of LabledPoints . 我有一个LabledPoints RDD Is it possible to select a subset of it based on a list of indeces? 是否可以根据索引列表选择子集?

For example with idx=[0,4,5,6,8] , I'd like to be able to get a new RDD with elements 0,4,5,6 and 8. 例如,对于idx=[0,4,5,6,8] ,我希望能够使用元素0、4、5、6和8获得新的RDD。

Note that I am not interested about random samples, which is available. 请注意,我对可用的随机样本不感兴趣。

Yes, you can either: 是的,您可以:

  1. Key your RDD by your set of values, put those values in another RDD, then do a leftOuterJoin to merge them, keeping only those in the set. 通过一组值来键控您的RDD,将这些值放入另一个RDD中,然后执行leftOuterJoin合并它们,仅将其保留在该组中。
  2. Put all your values into a broadcast variable (as a simple set) so that it gets shared across executors, the run a filter operation that validates that the points exist in your set. 将所有值放入广播变量 (作为一个简单集合),以便在执行程序之间共享它,运行过滤器操作以验证这些点是否存在于集合中。

Choose 1 if the list of values is large, else 2. 如果值列表很大,请选择1,否则选择2。


Edit to show a code sample for case 1. 编辑以显示案例1的代码示例。

val filteringValues = //read the list of values, same as you do your points, just easier 
            .keyBy(_)

val filtered = parsedData
            .keyBy(_.something) // Get the number from your inner structure
            .rigthOuterJoin(filteringValues) // This select only from your subset
            .flatMap(x => x._2._1) // Map it back to the original type.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM