Spark 2.2.0 API：我应该更喜欢使用Groupby结合Aggregate的Dataset或使用ReduceBykey结合RDD的数据集

Question

大家好，首先，基于标题，也许有人会说问题已经回答了，但我的意思是比较针对Dataset和RDD API的ReduceBykey，GroupBykey性能。 我在许多帖子中都看到，通过ReduceBykey方法的性能比通过GroupByKey的效率更高，当然我也同意这一点。 但是，我有点困惑，如果使用数据集或RDD，我无法弄清楚这些方法的行为。 每种情况下应使用哪一种？

我将尝试更加具体，因此我将通过解决方案以及工作代码来提供我的问题，并且我正等待您尽早为我提供有关此方面的改进。

+---+------------------+-----+
|id |Text1             |Text2|
+---+------------------+-----+
|1  |one,two,three     |one  |
|2  |four,one,five     |six  |
|3  |seven,nine,one,two|eight|
|4  |two,three,five    |five |
|5  |six,five,one      |seven|
+---+------------------+-----+

这里的重点是检查第二个Colum的EACH行中是否包含第三个Colum，然后收集所有ID。 例如，第三列«one»的单词出现在第二列ID为1、5、2、3的句子中。

+-----+------------+
|Text2|Set         |
+-----+------------+
|seven|[3]         |
|one  |[1, 5, 2, 3]|
|six  |[5]         |
|five |[5, 2, 4]   |
+-----+------------+

这是我的工作代码

List<Row> data = Arrays.asList(
                RowFactory.create(1, "one,two,three", "one"),
                RowFactory.create(2, "four,one,five", "six"),
                RowFactory.create(3, "seven,nine,one,two", "eight"),
                RowFactory.create(4, "two,three,five", "five"),
                RowFactory.create(5, "six,five,one", "seven")
        );

        StructType schema = new StructType(new StructField[]{
                new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
                new StructField("Text1", DataTypes.StringType, false, Metadata.empty()),
                new StructField("Text2", DataTypes.StringType, false, Metadata.empty())
        });

        Dataset<Row> df = spark.createDataFrame(data, schema);
        df.show(false);
        Dataset<Row> df1 = df.select("id", "Text1")
                .crossJoin(df.select("Text2"))
                .filter(col("Text1").contains(col("Text2")))
                .orderBy(col("Text2"));

        df1.show(false);

        Dataset<Row> df2 = df1
                .groupBy("Text2")
                .agg(collect_set(col("id")).as("Set"));

        df2.show(false);

我的问题分为3个子序列：

为了提高性能，我是否需要在RDD中转换数据集并使用ReduceBykey而不是数据集groupby？
我应该使用哪一个？为什么？ 数据集或RDD
如果您能提出一种替代解决方案，如果我的方法中存在这种解决方案，那将更加有效，我将不胜感激

Answer 1

TL; DR两者都不好，但是如果您使用的是Dataset保留Dataset 。

如果与适当的函数reduceByKey使用，则Dataset.groupBy行为类似于reduceByKey 。 不幸的是，如果重复次数groupByKey ， collect_set行为几乎类似于groupByKey 。 用reduceByKey重写它不会改变任何事情。

如果您能提出一种替代解决方案，如果我的方法中存在这种解决方案，那将更加有效，我将不胜感激

最好的办法是删除crossJoin ：

val df = Seq((1, "one,two,three", "one"),
  (2, "four,one,five", "six"),
  (3, "seven,nine,one,two", "eight"),
  (4, "two,three,five", "five"),
  (5, "six,five,one", "seven")).toDF("id", "text1", "text2")

df.select(col("id"), explode(split(col("Text1"), ",")).alias("w"))
  .join(df.select(col("Text2").alias("w")), Seq("w"))
  .groupBy("w")
  .agg(collect_set(col("id")).as("Set")).show

+-----+------------+
|    w|         Set|
+-----+------------+
|seven|         [3]|
|  one|[1, 5, 2, 3]|
|  six|         [5]|
| five|   [5, 2, 4]|
+-----+------------+

Spark 2.2.0 API：我应该更喜欢使用Groupby结合Aggregate的Dataset或使用ReduceBykey结合RDD的数据集

问题描述

1 个解决方案

解决方案1
1

Spark 2.2.0 API：我应该更喜欢使用Groupby结合Aggregate的Dataset或使用ReduceBykey结合RDD的数据集

问题描述

1 个解决方案

解决方案1 1

解决方案1
1