I have an rdd in this form,
rdd = sc.parallelize([('A', [1, 2, 4, 1, 2, 5]), ('B', [2, 3, 2, 1, 5, 10]), ('C', [3, 2, 5, 10, 5, 2])])
but I want to transformed the rdd like below,
newrdd = [('A', [1, 2, 4, 5]), ('B', [2, 3, 1, 5, 10], ('C', [3, 2, 5, 10])]
meaning, I have to get the distinct elements of values. ReduceByKey()
doesnt help here.
how can I achieve this?
Since Spark 2.4 you can use the PySpark SQL function array_distinct
:
df = rdd.toDF(("category", "values"))
df.withColumn("foo", array_distinct(col("values"))).show()
+--------+-------------------+----------------+
|category| values| foo|
+--------+-------------------+----------------+
| A| [1, 2, 4, 1, 2, 5]| [1, 2, 4, 5]|
| B|[2, 3, 2, 1, 5, 10]|[2, 3, 1, 5, 10]|
| C|[3, 2, 5, 10, 5, 2]| [3, 2, 5, 10]|
+--------+-------------------+----------------+
It has the advantage of not converting the JVM objects to Python objects and is therefore more efficient than any Python UDF. However, it's a DataFrame function, so you must convert the RDD to a DataFrame. That's also recommended for most cases.
Here is a direct way to get the result in Python. Note that the RDDs are immutable.
Setup Spark Session/Context
from pyspark.sql import SparkSession
from pyspark import SparkContext
spark = SparkSession.builder \
.master("local") \
.appName("SO Solution") \
.getOrCreate()
sc = spark.sparkContext
Solution Code
rdd = sc.parallelize([('A', [1, 2, 4, 1, 2, 5]), ('B', [2, 3, 2, 1, 5, 10]), ('C', [3, 2, 5, 10, 5, 2])])
newrdd = rdd.map(lambda x : (x[0], list(set(x[1]))))
newrdd.collect()
Output
[('A', [1, 2, 4, 5]), ('B', [1, 2, 3, 5, 10]), ('C', [10, 2, 3, 5])]
You can convert the array to set to get distinct values. Here is how - I have changed the syntax a little bit to use scala.
val spark : SparkSession = SparkSession.builder
.appName("Test")
.master("local[2]")
.getOrCreate()
import spark.implicits._
val df = spark.createDataset(List(("A", Array(1, 2, 4, 1, 2, 5)), ("B", Array(2, 3, 2, 1, 5, 10)), ("C", Array(3, 2, 5, 10, 5, 2))))
df.show()
val dfDistinct = df.map(r=> (r._1, r._2.toSet) )
dfDistinct.show()
old_rdd = [('A', [1, 2, 4, 1, 2, 5]), ('B', [2, 3, 2, 1, 5, 10]), ('C', [3, 2, 5, 10, 5, 2])]
new_rdd = [(letter, set(numbers)) for letter, numbers in old_rdd]
Like this?
Or list(set(numbers))
if you really need them to be a list?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.