[英]How to split an RDD by columns into a list of RDDs in Python
Let's say we have this RDD: 假设我们有这个RDD:
RDDs = sc.parallelize([["panda", 0], ["pink", 3]])
As RDDs has two columns now, want to get two RDDs like this: 由于RDD现在有两列,因此要获得两个RDD,如下所示:
RDDList[0] = (["panda"], ["pink"])
RDDList[1] = ([0], [3])
Couldn't find a discussion on this topic before, is this even feasible? 以前找不到关于此主题的讨论,这是否可行?
You can do the following 您可以执行以下操作
RDDs = sc.parallelize([["panda", 0], ["pink", 3]])
cols = [0, 1]
RDDList = [(RDDs.map(lambda x: [x[col]]).collect()) for col in cols]
which should give you 这应该给你
print RDDList[0]
#[['panda'], ['pink']]
print RDDList[1]
#[[0], [3]]
I hope the answer is helpful 我希望答案是有帮助的
This is built on @Ramesh Maharjan answer to get it work for any RDD (python 3.x) 这是基于@Ramesh Maharjan答案构建的,以使其适用于任何RDD(python 3.x)
RDDList = []
for i in range(0,len(RDDs.first())):
RDDList.append(RDDs.map(lambda x: [x[i]]).collect())
print (RDDList[0])
print (RDDList[1])
Expected Output: 预期产量:
[['panda'], ['pink']]
[[0], [3]]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.