我有两个长度相同的RDD,我想将它们随机压缩在一起(例如,第一个RDD是(A,B,C,D),第二个是(W,X,Y,Z),我想要一个随机的zip例如(AX,BZ,CW,DY)。使用pySpark进行此操作的快速方法是什么?
Is this what you need?
x = sc.parallelize(['A','B','C','D'])
y = sc.parallelize(['W','X','Y','Z'])
x = x.takeSample(False, 4)
y = y.takeSample(False, 4)
combine = zip(x,y)
combine
>> [('D', 'Z'), ('B', 'X'), ('A', 'W'), ('C', 'Y')]
You can:
from pyspark.sql.functions import rand
s = lambda x: (x[1], x[0])
def shuffle(rdd):
return rdd.map(lambda x: (x, )) \
.toDF(["data"]).withColumn("rand", rand()) \
.orderBy("rand") \
.rdd.map(lambda x: x.data)
shuffle(rdd1).zipWithIndex().map(s).join(rdd2.zipWithIndex().map(s)).values()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.