Hello all, to begin with, base on the title someone may say that the question is already answered but my point is to compare ReduceBykey, GroupBykey performance, specific on the Dataset and RDD API. i have seen in many posts that the performance over the ReduceBykey method is more efficient over GroupByKey and of course i agree with this. Nevertheless, I am little confused and i can't figure out how these methods behaves if we use a Dataset or RDD. Which one should be used one each case?
I will try to be more specific, thus i will provide my problem with my solve as well as with the working code and i am waiting at your earliest convenience to suggest me an improvements on this.
+---+------------------+-----+
|id |Text1 |Text2|
+---+------------------+-----+
|1 |one,two,three |one |
|2 |four,one,five |six |
|3 |seven,nine,one,two|eight|
|4 |two,three,five |five |
|5 |six,five,one |seven|
+---+------------------+-----+
The point here is to check if the third Colum contained on EACH row of the second Colum and after that, collect all the ID of thems. For example, the word of the third column «one» appeared in the sentences of second column with ID 1, 5, 2, 3.
+-----+------------+
|Text2|Set |
+-----+------------+
|seven|[3] |
|one |[1, 5, 2, 3]|
|six |[5] |
|five |[5, 2, 4] |
+-----+------------+
Here is my working code
List<Row> data = Arrays.asList(
RowFactory.create(1, "one,two,three", "one"),
RowFactory.create(2, "four,one,five", "six"),
RowFactory.create(3, "seven,nine,one,two", "eight"),
RowFactory.create(4, "two,three,five", "five"),
RowFactory.create(5, "six,five,one", "seven")
);
StructType schema = new StructType(new StructField[]{
new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
new StructField("Text1", DataTypes.StringType, false, Metadata.empty()),
new StructField("Text2", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> df = spark.createDataFrame(data, schema);
df.show(false);
Dataset<Row> df1 = df.select("id", "Text1")
.crossJoin(df.select("Text2"))
.filter(col("Text1").contains(col("Text2")))
.orderBy(col("Text2"));
df1.show(false);
Dataset<Row> df2 = df1
.groupBy("Text2")
.agg(collect_set(col("id")).as("Set"));
df2.show(false);
My question detailed in 3 subsequences:
TL;DR Both are bad, but if you're using Dataset
stay with Dataset
.
Dataset.groupBy
behaves like reduceByKey
if used with suitable function. Unfortunately collect_set
behaves pretty much like groupByKey
, if number of duplicates is low. Rewriting it with reduceByKey
won't change a thing .
i would be grateful if you could give an alternative solution that is more efficient if exists in my approach
Best you can do is to remove crossJoin
:
val df = Seq((1, "one,two,three", "one"),
(2, "four,one,five", "six"),
(3, "seven,nine,one,two", "eight"),
(4, "two,three,five", "five"),
(5, "six,five,one", "seven")).toDF("id", "text1", "text2")
df.select(col("id"), explode(split(col("Text1"), ",")).alias("w"))
.join(df.select(col("Text2").alias("w")), Seq("w"))
.groupBy("w")
.agg(collect_set(col("id")).as("Set")).show
+-----+------------+
| w| Set|
+-----+------------+
|seven| [3]|
| one|[1, 5, 2, 3]|
| six| [5]|
| five| [5, 2, 4]|
+-----+------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.