在Spark- Python中子集RDD

Question

I have an RDD of LabledPoints . 我有一个LabledPoints RDD 。 Is it possible to select a subset of it based on a list of indeces? 是否可以根据索引列表选择子集？

For example with idx=[0,4,5,6,8] , I'd like to be able to get a new RDD with elements 0,4,5,6 and 8. 例如，对于idx=[0,4,5,6,8] ，我希望能够使用元素0、4、5、6和8获得新的RDD。

Note that I am not interested about random samples, which is available. 请注意，我对可用的随机样本不感兴趣。

Answer 1

Yes, you can either: 是的，您可以：

Key your RDD by your set of values, put those values in another RDD, then do a leftOuterJoin to merge them, keeping only those in the set. 通过一组值来键控您的RDD，将这些值放入另一个RDD中，然后执行leftOuterJoin合并它们，仅将其保留在该组中。
Put all your values into a broadcast variable (as a simple set) so that it gets shared across executors, the run a filter operation that validates that the points exist in your set. 将所有值放入广播变量（作为一个简单集合），以便在执行程序之间共享它，运行过滤器操作以验证这些点是否存在于集合中。

Choose 1 if the list of values is large, else 2. 如果值列表很大，请选择1，否则选择2。

Edit to show a code sample for case 1. 编辑以显示案例1的代码示例。

val filteringValues = //read the list of values, same as you do your points, just easier 
            .keyBy(_)

val filtered = parsedData
            .keyBy(_.something) // Get the number from your inner structure
            .rigthOuterJoin(filteringValues) // This select only from your subset
            .flatMap(x => x._2._1) // Map it back to the original type.

在Spark- Python中子集RDD

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-04-24 21:26:59

在Spark- Python中子集RDD

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-04-24 21:26:59

解决方案1
2 已采纳 2015-04-24 21:26:59