簡體 English 中英

在Spark- Python中子集RDD

[英]Subsetting RDDs in Spark- Python

原文 2015-04-24 16:43:16 6 1 python/ apache-spark

我有一個LabledPoints RDD 。 是否可以根據索引列表選擇子集？

例如，對於idx=[0,4,5,6,8] ，我希望能夠使用元素0、4、5、6和8獲得新的RDD。

請注意，我對可用的隨機樣本不感興趣。

1 個解決方案

是的，您可以：

通過一組值來鍵控您的RDD，將這些值放入另一個RDD中，然后執行leftOuterJoin合並它們，僅將其保留在該組中。
將所有值放入廣播變量（作為一個簡單集合），以便在執行程序之間共享它，運行過濾器操作以驗證這些點是否存在於集合中。

如果值列表很大，請選擇1，否則選擇2。

編輯以顯示案例1的代碼示例。

val filteringValues = //read the list of values, same as you do your points, just easier 
            .keyBy(_)

val filtered = parsedData
            .keyBy(_.something) // Get the number from your inner structure
            .rigthOuterJoin(filteringValues) // This select only from your subset
            .flatMap(x => x._2._1) // Map it back to the original type.

在python Spark中組合2個RDD

[英]Combining 2 RDDs in python Spark

spark-查找每行的len（python）

[英]spark- find the len of each row (python)

Python-Spark-RDD日期轉換

[英]Python - Spark - RDDs date convert

Python中Spark RDD的列操作

[英]Column operation on Spark RDDs in Python

在Spark Python中對RDD執行設置差異

[英]Perform Set Difference on RDDs in Spark Python

將兩個rdds的值除以spark（python）

[英]Divide the values of two rdds in spark (python)

如何將Python連接到Spark會話並保持RDD活着

[英]How to Connect Python to Spark Session and Keep RDDs Alive

如何使用 Python Core API (Apache Spark) 加入三個 RDD？

[英]How to join three RDDs using the Python Core API (Apache Spark)?

多個 RDD 的 Spark 聯合

[英]Spark union of multiple RDDs

Spark：操縱多個RDD

[英]Spark: Manipulation of Multiple RDDs

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 在python Spark中組合2個RDD spark-查找每行的len（python） Python-Spark-RDD日期轉換 Python中Spark RDD的列操作在Spark Python中對RDD執行設置差異將兩個rdds的值除以spark（python）如何將Python連接到Spark會話並保持RDD活着如何使用 Python Core API (Apache Spark) 加入三個 RDD？多個 RDD 的 Spark 聯合 Spark：操縱多個RDD

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM