pyspark: get the distinct elements of list values

Question

I have an rdd in this form,

rdd = sc.parallelize([('A', [1, 2, 4, 1, 2, 5]), ('B', [2, 3, 2, 1, 5, 10]), ('C', [3, 2, 5, 10, 5, 2])])

but I want to transformed the rdd like below,

newrdd = [('A', [1, 2, 4, 5]), ('B', [2, 3, 1, 5, 10], ('C', [3, 2, 5, 10])]

meaning, I have to get the distinct elements of values. ReduceByKey() doesnt help here.

how can I achieve this?

Answer 1

Since Spark 2.4 you can use the PySpark SQL function array_distinct :

df = rdd.toDF(("category", "values"))
df.withColumn("foo", array_distinct(col("values"))).show()
+--------+-------------------+----------------+
|category|             values|             foo|
+--------+-------------------+----------------+
|       A| [1, 2, 4, 1, 2, 5]|    [1, 2, 4, 5]|
|       B|[2, 3, 2, 1, 5, 10]|[2, 3, 1, 5, 10]|
|       C|[3, 2, 5, 10, 5, 2]|   [3, 2, 5, 10]|
+--------+-------------------+----------------+

It has the advantage of not converting the JVM objects to Python objects and is therefore more efficient than any Python UDF. However, it's a DataFrame function, so you must convert the RDD to a DataFrame. That's also recommended for most cases.

Answer 2

Here is a direct way to get the result in Python. Note that the RDDs are immutable.

Setup Spark Session/Context

from pyspark.sql import SparkSession
from pyspark import SparkContext

spark = SparkSession.builder \
            .master("local") \
            .appName("SO Solution") \
            .getOrCreate()

sc = spark.sparkContext

Solution Code

rdd = sc.parallelize([('A', [1, 2, 4, 1, 2, 5]), ('B', [2, 3, 2, 1, 5, 10]), ('C', [3, 2, 5, 10, 5, 2])])

newrdd = rdd.map(lambda x : (x[0], list(set(x[1]))))

newrdd.collect()

Output

[('A', [1, 2, 4, 5]), ('B', [1, 2, 3, 5, 10]), ('C', [10, 2, 3, 5])]

Answer 3

You can convert the array to set to get distinct values. Here is how - I have changed the syntax a little bit to use scala.

    val spark : SparkSession = SparkSession.builder
      .appName("Test")
      .master("local[2]")
      .getOrCreate()
    import spark.implicits._
    val df = spark.createDataset(List(("A", Array(1, 2, 4, 1, 2, 5)), ("B", Array(2, 3, 2, 1, 5, 10)), ("C", Array(3, 2, 5, 10, 5, 2))))
    df.show()

    val dfDistinct = df.map(r=> (r._1, r._2.toSet) )
    dfDistinct.show()

Answer 4

old_rdd = [('A', [1, 2, 4, 1, 2, 5]), ('B', [2, 3, 2, 1, 5, 10]), ('C', [3, 2, 5, 10, 5, 2])]
new_rdd = [(letter, set(numbers)) for letter, numbers in old_rdd]

Like this?

Or list(set(numbers)) if you really need them to be a list?

pyspark: get the distinct elements of list values

Question

4 answers

solution1
2 2019-12-27 01:13:00

solution2
1 2019-12-27 00:59:37

solution3
0 2019-12-27 00:10:37

solution4
-1 2019-12-27 00:08:48

pyspark: get the distinct elements of list values

Question

4 answers

solution1 2 2019-12-27 01:13:00

solution2 1 2019-12-27 00:59:37

solution3 0 2019-12-27 00:10:37

solution4 -1 2019-12-27 00:08:48

solution1
2 2019-12-27 01:13:00

solution2
1 2019-12-27 00:59:37

solution3
0 2019-12-27 00:10:37

solution4
-1 2019-12-27 00:08:48