简体   繁体   中英

spark: scramble RDDs and zip them

我有两个长度相同的RDD,我想将它们随机压缩在一起(例如,第一个RDD是(A,B,C,D),第二个是(W,X,Y,Z),我想要一个随机的zip例如(AX,BZ,CW,DY)。使用pySpark进行此操作的快速方法是什么?

Is this what you need?

x = sc.parallelize(['A','B','C','D'])
y = sc.parallelize(['W','X','Y','Z'])
x = x.takeSample(False, 4)
y = y.takeSample(False, 4)
combine = zip(x,y)
combine
>> [('D', 'Z'), ('B', 'X'), ('A', 'W'), ('C', 'Y')]

You can:

from pyspark.sql.functions import rand

s = lambda x: (x[1], x[0])

def shuffle(rdd):
    return rdd.map(lambda x: (x, )) \
              .toDF(["data"]).withColumn("rand", rand()) \
              .orderBy("rand") \
              .rdd.map(lambda x: x.data)

shuffle(rdd1).zipWithIndex().map(s).join(rdd2.zipWithIndex().map(s)).values()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM