当我调用 rdd.join(rdd) 时发生了什么

Question

I am developing an application where I need to perform calculations for each couple of rows with the same key in RDD, here is the RDD structure:我正在开发一个应用程序，我需要对 RDD 中具有相同键的每一对行执行计算，这里是 RDD 结构：

List<Tuple2<String, Tuple2<Integer, Integer>>> dat2 = new ArrayList<>();
dat2.add(new Tuple2<String, Tuple2<Integer, Integer>>("Alice", new Tuple2<Integer, Integer>(1, 1)));
dat2.add(new Tuple2<String, Tuple2<Integer, Integer>>("Alice", new Tuple2<Integer, Integer>(2, 5)));
dat2.add(new Tuple2<String, Tuple2<Integer, Integer>>("Alice", new Tuple2<Integer, Integer>(3, 78)));
dat2.add(new Tuple2<String, Tuple2<Integer, Integer>>("Bob", new Tuple2<Integer, Integer>(1, 6)));
dat2.add(new Tuple2<String, Tuple2<Integer, Integer>>("Bob", new Tuple2<Integer, Integer>(2, 11)));
JavaRDD<Tuple2<String, Tuple2<Integer, Integer>>> y2 = sc.parallelize(dat2);

now, data for every person can be viewed like: (timestamp, value).现在，每个人的数据都可以这样查看：（时间戳，值）。 I wish to know for every row the number of values happening in +-1 timestamp.我想知道每行 +-1 时间戳中发生的值的数量。 ( I am aware this looks like sliding window but I want event level granularity ) （我知道这看起来像滑动窗口，但我想要事件级粒度）

y2.join(y2);
resultOfJoin.filter(t -> t._2()._1()._1() - t._2()._2()._1() <= 1 && t._2()._1()._1() - t._2()._2()._1() >= -1)

The best solution I came for in this case was to join the RDD with itself, creating k^2 rows for every person where k is the number of rows associated with this person.在这种情况下，我找到的最佳解决方案是将 RDD 与其自身连接起来，为每个人创建k^2行，其中 k 是与此人关联的行数。

now, I do know this is a complete disaster .现在，我知道这是一场彻头彻尾的灾难。 I understand this will cause a shuffle (and shuffles are bad m'key) but I couldn't come with anything better.我知道这会导致洗牌（并且洗牌不好 m'key），但我不能带来更好的东西。

I have 3 questions:我有3个问题：

Since I filter right after the join, will it effect the stress caused by the join (in other words, will there be any optimizations)?由于我是在join后立即过滤，是否会影响join造成的压力（换句话说，会不会有任何优化）？
What is the volume of rows passed on the network?网络上传递的行数是多少？ (I am aware that in the worst case the result RDD will have n^2 rows) will the rows sent on network be #workers n (sending only one copy and duplicating on worker) or #workers n^2 (sending row for each 2 row combination on the result worker)? （我知道在最坏的情况下，结果 RDD 将有 n^2 行）在网络上发送的行是 #workers n（仅发送一份副本并在 worker 上复制）还是 #workers n^2（为每个发送行结果工作者的 2 行组合）？
If I would of worked with Dataset I could join with filter.如果我愿意使用Dataset我可以加入过滤器。 I understand Datasets have additional optimization for the computation graph.我知道数据集对计算图有额外的优化。 How much improvement, if any, should I expect if I transit to Datasets?如果我过渡到数据集，我应该期待多少改进（如果有）？

Answer 1

Since I filter right after the join, will it effect the stress caused by the join (in other words, will there be any optimizations)?由于我是在join后立即过滤，是否会影响join造成的压力（换句话说，会不会有任何优化）？

No, there will be no optimizations.不，不会有优化。

What is the volume of rows passed on the network?网络上传递的行数是多少？

O(N) (specifically each record will be shuffled twice, once for each parent) You join by key, so each item goes to one, and only one partition. O(N) （特别是每条记录将被洗牌两次，每个父级一次）您通过键加入，因此每个项目都进入一个，并且只有一个分区。

If I would of worked with Dataset I could join with filter.如果我愿意使用 Dataset，我可以加入过滤器。 I understand Datasets have additional optimization for the computation graph.我知道数据集对计算图有额外的优化。 How much improvement, if any, should I expect if I transit to Datasets?如果我过渡到数据集，我应该期待多少改进（如果有）？

Shuffle process is better optimized, but otherwise you cannot expect any case specific optimizations. Shuffle 过程得到了更好的优化，但除此之外，您不能期望任何特定于案例的优化。

wish to know for every row the number of values happening in +-1 timestamp.希望知道每行 +-1 时间戳中发生的值的数量。

Try window functions:尝试窗口函数：

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._

val w = Window.partitionBy("id").ordetBy("timestamp")

rdd.toDF("id", "data")
  .select($"id", $"data._1" as "timestamp", $"data._2" as "value"))
  .withColumn("lead", lead($"value", 1).over(w))
  .withColumn("lag", lag($"value", 1).over(w))

当我调用 rdd.join(rdd) 时发生了什么

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-12-27 13:20:32

当我调用 rdd.join(rdd) 时发生了什么

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-12-27 13:20:32

解决方案1
2 已采纳 2017-12-27 13:20:32