简体繁体中英

spark: scramble RDDs and zip them

原文 2016-12-02 20:34:19 3 2 python/ apache-spark/ pyspark

我有两个长度相同的RDD，我想将它们随机压缩在一起（例如，第一个RDD是（A，B，C，D），第二个是（W，X，Y，Z），我想要一个随机的zip例如（AX，BZ，CW，DY）。使用pySpark进行此操作的快速方法是什么？

2 answers

Is this what you need?

x = sc.parallelize(['A','B','C','D'])
y = sc.parallelize(['W','X','Y','Z'])
x = x.takeSample(False, 4)
y = y.takeSample(False, 4)
combine = zip(x,y)
combine
>> [('D', 'Z'), ('B', 'X'), ('A', 'W'), ('C', 'Y')]

You can:

from pyspark.sql.functions import rand

s = lambda x: (x[1], x[0])

def shuffle(rdd):
    return rdd.map(lambda x: (x, )) \
              .toDF(["data"]).withColumn("rand", rand()) \
              .orderBy("rand") \
              .rdd.map(lambda x: x.data)

shuffle(rdd1).zipWithIndex().map(s).join(rdd2.zipWithIndex().map(s)).values()

Combining 2 RDDs in python Spark

Spark union of multiple RDDs

Spark: Manipulation of Multiple RDDs

How to compare two rdds in Spark?

Subsetting RDDs in Spark- Python

Python - Spark - RDDs date convert

Column operation on Spark RDDs in Python

Does Spark discard ephemeral rdds immediately?

Join two RDDs on custom function - SPARK

Merge Spark RDDs from bad JSON

暂无

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Combining 2 RDDs in python Spark Spark union of multiple RDDs Spark: Manipulation of Multiple RDDs How to compare two rdds in Spark? Subsetting RDDs in Spark- Python Python - Spark - RDDs date convert Column operation on Spark RDDs in Python Does Spark discard ephemeral rdds immediately? Join two RDDs on custom function - SPARK Merge Spark RDDs from bad JSON

Related Tags

粤ICP备18138465号 © 2020-2024 STACKOOM.COM