繁体 English 中英

在Spark- Python中子集RDD

[英]Subsetting RDDs in Spark- Python

原文 2015-04-24 16:43:16 7 1 python/ apache-spark

我有一个LabledPoints RDD 。 是否可以根据索引列表选择子集？

例如，对于idx=[0,4,5,6,8] ，我希望能够使用元素0、4、5、6和8获得新的RDD。

请注意，我对可用的随机样本不感兴趣。

1 个解决方案

是的，您可以：

通过一组值来键控您的RDD，将这些值放入另一个RDD中，然后执行leftOuterJoin合并它们，仅将其保留在该组中。
将所有值放入广播变量（作为一个简单集合），以便在执行程序之间共享它，运行过滤器操作以验证这些点是否存在于集合中。

如果值列表很大，请选择1，否则选择2。

编辑以显示案例1的代码示例。

val filteringValues = //read the list of values, same as you do your points, just easier 
            .keyBy(_)

val filtered = parsedData
            .keyBy(_.something) // Get the number from your inner structure
            .rigthOuterJoin(filteringValues) // This select only from your subset
            .flatMap(x => x._2._1) // Map it back to the original type.

在python Spark中组合2个RDD

[英]Combining 2 RDDs in python Spark

spark-查找每行的len（python）

[英]spark- find the len of each row (python)

Python-Spark-RDD日期转换

[英]Python - Spark - RDDs date convert

Python中Spark RDD的列操作

[英]Column operation on Spark RDDs in Python

在Spark Python中对RDD执行设置差异

[英]Perform Set Difference on RDDs in Spark Python

将两个rdds的值除以spark（python）

[英]Divide the values of two rdds in spark (python)

如何将Python连接到Spark会话并保持RDD活着

[英]How to Connect Python to Spark Session and Keep RDDs Alive

如何使用 Python Core API (Apache Spark) 加入三个 RDD？

[英]How to join three RDDs using the Python Core API (Apache Spark)?

多个 RDD 的 Spark 联合

[英]Spark union of multiple RDDs

Spark：操纵多个RDD

[英]Spark: Manipulation of Multiple RDDs

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在python Spark中组合2个RDD spark-查找每行的len（python） Python-Spark-RDD日期转换 Python中Spark RDD的列操作在Spark Python中对RDD执行设置差异将两个rdds的值除以spark（python）如何将Python连接到Spark会话并保持RDD活着如何使用 Python Core API (Apache Spark) 加入三个 RDD？多个 RDD 的 Spark 联合 Spark：操纵多个RDD

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM