简体   繁体   中英

Subsetting RDDs in Spark- Python

I have an RDD of LabledPoints . Is it possible to select a subset of it based on a list of indeces?

For example with idx=[0,4,5,6,8] , I'd like to be able to get a new RDD with elements 0,4,5,6 and 8.

Note that I am not interested about random samples, which is available.

Yes, you can either:

  1. Key your RDD by your set of values, put those values in another RDD, then do a leftOuterJoin to merge them, keeping only those in the set.
  2. Put all your values into a broadcast variable (as a simple set) so that it gets shared across executors, the run a filter operation that validates that the points exist in your set.

Choose 1 if the list of values is large, else 2.


Edit to show a code sample for case 1.

val filteringValues = //read the list of values, same as you do your points, just easier 
            .keyBy(_)

val filtered = parsedData
            .keyBy(_.something) // Get the number from your inner structure
            .rigthOuterJoin(filteringValues) // This select only from your subset
            .flatMap(x => x._2._1) // Map it back to the original type.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM