Spark 2.2.0 API: Which one should i prefer Dataset with Groupby combined with aggregate or RDD with ReduceBykey

Question

Hello all, to begin with, base on the title someone may say that the question is already answered but my point is to compare ReduceBykey, GroupBykey performance, specific on the Dataset and RDD API. i have seen in many posts that the performance over the ReduceBykey method is more efficient over GroupByKey and of course i agree with this. Nevertheless, I am little confused and i can't figure out how these methods behaves if we use a Dataset or RDD. Which one should be used one each case?

I will try to be more specific, thus i will provide my problem with my solve as well as with the working code and i am waiting at your earliest convenience to suggest me an improvements on this.

+---+------------------+-----+
|id |Text1             |Text2|
+---+------------------+-----+
|1  |one,two,three     |one  |
|2  |four,one,five     |six  |
|3  |seven,nine,one,two|eight|
|4  |two,three,five    |five |
|5  |six,five,one      |seven|
+---+------------------+-----+

The point here is to check if the third Colum contained on EACH row of the second Colum and after that, collect all the ID of thems. For example, the word of the third column «one» appeared in the sentences of second column with ID 1, 5, 2, 3.

+-----+------------+
|Text2|Set         |
+-----+------------+
|seven|[3]         |
|one  |[1, 5, 2, 3]|
|six  |[5]         |
|five |[5, 2, 4]   |
+-----+------------+

Here is my working code

List<Row> data = Arrays.asList(
                RowFactory.create(1, "one,two,three", "one"),
                RowFactory.create(2, "four,one,five", "six"),
                RowFactory.create(3, "seven,nine,one,two", "eight"),
                RowFactory.create(4, "two,three,five", "five"),
                RowFactory.create(5, "six,five,one", "seven")
        );

        StructType schema = new StructType(new StructField[]{
                new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
                new StructField("Text1", DataTypes.StringType, false, Metadata.empty()),
                new StructField("Text2", DataTypes.StringType, false, Metadata.empty())
        });

        Dataset<Row> df = spark.createDataFrame(data, schema);
        df.show(false);
        Dataset<Row> df1 = df.select("id", "Text1")
                .crossJoin(df.select("Text2"))
                .filter(col("Text1").contains(col("Text2")))
                .orderBy(col("Text2"));

        df1.show(false);

        Dataset<Row> df2 = df1
                .groupBy("Text2")
                .agg(collect_set(col("id")).as("Set"));

        df2.show(false);

My question detailed in 3 subsequences:

In order to improve the performance do i need to convert the Dataset in RDD and make ReduceBykey instead of Dataset groupby?
Which one should i use and why? Dataset or RDD
i would be grateful if you could give an alternative solution that is more efficient if exists in my approach

Answer 1

TL;DR Both are bad, but if you're using Dataset stay with Dataset .

Dataset.groupBy behaves like reduceByKey if used with suitable function. Unfortunately collect_set behaves pretty much like groupByKey , if number of duplicates is low. Rewriting it with reduceByKey won't change a thing .

i would be grateful if you could give an alternative solution that is more efficient if exists in my approach

Best you can do is to remove crossJoin :

val df = Seq((1, "one,two,three", "one"),
  (2, "four,one,five", "six"),
  (3, "seven,nine,one,two", "eight"),
  (4, "two,three,five", "five"),
  (5, "six,five,one", "seven")).toDF("id", "text1", "text2")

df.select(col("id"), explode(split(col("Text1"), ",")).alias("w"))
  .join(df.select(col("Text2").alias("w")), Seq("w"))
  .groupBy("w")
  .agg(collect_set(col("id")).as("Set")).show

+-----+------------+
|    w|         Set|
+-----+------------+
|seven|         [3]|
|  one|[1, 5, 2, 3]|
|  six|         [5]|
| five|   [5, 2, 4]|
+-----+------------+

Spark 2.2.0 API: Which one should i prefer Dataset with Groupby combined with aggregate or RDD with ReduceBykey

Question

1 answers

solution1
1

Spark 2.2.0 API: Which one should i prefer Dataset with Groupby combined with aggregate or RDD with ReduceBykey

Question

1 answers

solution1 1

solution1
1