[英]How to split an RDD into two RDDs and save the result as RDDs with PySpark?
[英]How to split an RDD by columns into a list of RDDs in Python
假設我們有這個RDD:
RDDs = sc.parallelize([["panda", 0], ["pink", 3]])
由於RDD現在有兩列,因此要獲得兩個RDD,如下所示:
RDDList[0] = (["panda"], ["pink"])
RDDList[1] = ([0], [3])
以前找不到關於此主題的討論,這是否可行?
您可以執行以下操作
RDDs = sc.parallelize([["panda", 0], ["pink", 3]])
cols = [0, 1]
RDDList = [(RDDs.map(lambda x: [x[col]]).collect()) for col in cols]
這應該給你
print RDDList[0]
#[['panda'], ['pink']]
print RDDList[1]
#[[0], [3]]
我希望答案是有幫助的
這是基於@Ramesh Maharjan答案構建的,以使其適用於任何RDD(python 3.x)
RDDList = []
for i in range(0,len(RDDs.first())):
RDDList.append(RDDs.map(lambda x: [x[i]]).collect())
print (RDDList[0])
print (RDDList[1])
預期產量:
[['panda'], ['pink']]
[[0], [3]]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.