简体繁体中英

Subsetting RDDs in Spark- Python

原文 2015-04-24 16:43:16 3 1 python/ apache-spark

I have an RDD of LabledPoints . Is it possible to select a subset of it based on a list of indeces?

For example with idx=[0,4,5,6,8] , I'd like to be able to get a new RDD with elements 0,4,5,6 and 8.

Note that I am not interested about random samples, which is available.

1 answers

Yes, you can either:

Key your RDD by your set of values, put those values in another RDD, then do a leftOuterJoin to merge them, keeping only those in the set.
Put all your values into a broadcast variable (as a simple set) so that it gets shared across executors, the run a filter operation that validates that the points exist in your set.

Choose 1 if the list of values is large, else 2.

Edit to show a code sample for case 1.

val filteringValues = //read the list of values, same as you do your points, just easier 
            .keyBy(_)

val filtered = parsedData
            .keyBy(_.something) // Get the number from your inner structure
            .rigthOuterJoin(filteringValues) // This select only from your subset
            .flatMap(x => x._2._1) // Map it back to the original type.

Combining 2 RDDs in python Spark

spark- find the len of each row (python)

Python - Spark - RDDs date convert

Column operation on Spark RDDs in Python

Perform Set Difference on RDDs in Spark Python

Divide the values of two rdds in spark (python)

How to Connect Python to Spark Session and Keep RDDs Alive

How to join three RDDs using the Python Core API (Apache Spark)?

Spark union of multiple RDDs

Spark: Manipulation of Multiple RDDs

暂无

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Combining 2 RDDs in python Spark spark- find the len of each row (python) Python - Spark - RDDs date convert Column operation on Spark RDDs in Python Perform Set Difference on RDDs in Spark Python Divide the values of two rdds in spark (python) How to Connect Python to Spark Session and Keep RDDs Alive How to join three RDDs using the Python Core API (Apache Spark)? Spark union of multiple RDDs Spark: Manipulation of Multiple RDDs

Related Tags

粤ICP备18138465号 © 2020-2024 STACKOOM.COM