spark job get only one result for one key

Question

I now have a lot of key value pairs (key, value)

Now for one key, I don't want to get the value's average or some other aggregations, I just need one value. (Get the distinct keys)

Let me have an example here,

("1","apple")
("1","apple")
("2","orange")
("2","orange")
("1","apple")
("1","pear")

The result can be

("2","orange")
("1","apple")

or

("2","orange")
("1","pear")

I can use reduceByKey(((a,b) => a)) to get this, but as there are a lot of keys, the time is very long.

Any one have some better suggestions ?

Thanks!

Answer 1

Yiling, you may use the transformation distinct to keep distinct elements in your RDD. https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.rdd.RDD

Answer 2

Actually it is a typical map-reduce like problem. But you just want only one value to each key, so simply you can do it in reduce phase, although it is not the best way. And now you know that using reduceByKey only will cost a lot of time in useless shuffle, which means you should pre-reduce your data in Mapper. So the answer is obvious for you: using combiner.

In spark you can use combineByKey before your reduceByKey to remove duplicate values.

==========

Besides combiner, you can also change the shuffle method. The default shuffle for Spark 1.2+ is SortShuffle. You can change it to HashShuffle which can reduce the cost of sorting keys.

try to set this in your sparkConf

spark.shuffle.manager = hash
spark.shuffle.consolidateFiles = true

But you have to pay attention that too much map cores may produce too much shuffle files which will affect the performance. spark.shuffle.consolidateFiles is used to merge mapper output files.

Answer 3

you can use dropDuplicates() of DataFrame.

val df = sc.parallelize(
  List(
      ("1", "apple"),
      ("1", "apple"),
      ("2", "orange"),
      ("2", "orange"),
      ("1", "apple"),
      ("1", "pear")
  )
).toDF("count", "name")

df.show()
+-----+------+
|count|  name|
+-----+------+
|    1| apple|
|    1| apple|
|    2|orange|
|    2|orange|
|    1| apple|
|    1|  pear|
+-----+------+

drop duplicates by name

val uniqueDf = df.dropDuplicates("name")

Now pick top 2 unique rows

uniqueDf.limit(2).show()

+-----+------+
|count|  name|
+-----+------+
|    2|orange|
|    1| apple|
+-----+------+

Unique records without limit

uniqueDf.show()
+-----+------+
|count|  name|
+-----+------+
|    2|orange|
|    1| apple|
|    1|  pear|
+-----+------+

Edit:

You can use collect() on DataFrame to get values into List.

spark job get only one result for one key

Question

3 answers

solution1
1 2016-12-28 10:37:30

solution2
1 ACCPTED 2016-12-28 10:39:21

solution3
1 2016-12-28 16:17:31

spark job get only one result for one key

Question

3 answers

solution1 1 2016-12-28 10:37:30

solution2 1 ACCPTED 2016-12-28 10:39:21

solution3 1 2016-12-28 16:17:31

solution1
1 2016-12-28 10:37:30

solution2
1 ACCPTED 2016-12-28 10:39:21

solution3
1 2016-12-28 16:17:31