[英]What is happening when I call rdd.join(rdd)
I am developing an application where I need to perform calculations for each couple of rows with the same key in RDD, here is the RDD structure:我正在开发一个应用程序,我需要对 RDD 中具有相同键的每一对行执行计算,这里是 RDD 结构:
List<Tuple2<String, Tuple2<Integer, Integer>>> dat2 = new ArrayList<>();
dat2.add(new Tuple2<String, Tuple2<Integer, Integer>>("Alice", new Tuple2<Integer, Integer>(1, 1)));
dat2.add(new Tuple2<String, Tuple2<Integer, Integer>>("Alice", new Tuple2<Integer, Integer>(2, 5)));
dat2.add(new Tuple2<String, Tuple2<Integer, Integer>>("Alice", new Tuple2<Integer, Integer>(3, 78)));
dat2.add(new Tuple2<String, Tuple2<Integer, Integer>>("Bob", new Tuple2<Integer, Integer>(1, 6)));
dat2.add(new Tuple2<String, Tuple2<Integer, Integer>>("Bob", new Tuple2<Integer, Integer>(2, 11)));
JavaRDD<Tuple2<String, Tuple2<Integer, Integer>>> y2 = sc.parallelize(dat2);
now, data for every person can be viewed like: (timestamp, value).现在,每个人的数据都可以这样查看:(时间戳,值)。 I wish to know for every row the number of values happening in +-1 timestamp.
我想知道每行 +-1 时间戳中发生的值的数量。 ( I am aware this looks like sliding window but I want event level granularity )
(我知道这看起来像滑动窗口,但我想要事件级粒度)
y2.join(y2);
resultOfJoin.filter(t -> t._2()._1()._1() - t._2()._2()._1() <= 1 && t._2()._1()._1() - t._2()._2()._1() >= -1)
The best solution I came for in this case was to join the RDD with itself, creating k^2
rows for every person where k is the number of rows associated with this person.在这种情况下,我找到的最佳解决方案是将 RDD 与其自身连接起来,为每个人创建
k^2
行,其中 k 是与此人关联的行数。
now, I do know this is a complete disaster .现在,我知道这是一场彻头彻尾的灾难。 I understand this will cause a shuffle (and shuffles are bad m'key) but I couldn't come with anything better.
我知道这会导致洗牌(并且洗牌不好 m'key),但我不能带来更好的东西。
I have 3 questions:我有3个问题:
Dataset
I could join with filter.Dataset
我可以加入过滤器。 I understand Datasets have additional optimization for the computation graph.Since I filter right after the join, will it effect the stress caused by the join (in other words, will there be any optimizations)?
由于我是在join后立即过滤,是否会影响join造成的压力(换句话说,会不会有任何优化)?
No, there will be no optimizations.不,不会有优化。
What is the volume of rows passed on the network?
网络上传递的行数是多少?
O(N) (specifically each record will be shuffled twice, once for each parent) You join by key, so each item goes to one, and only one partition. O(N) (特别是每条记录将被洗牌两次,每个父级一次)您通过键加入,因此每个项目都进入一个,并且只有一个分区。
If I would of worked with Dataset I could join with filter.
如果我愿意使用 Dataset,我可以加入过滤器。 I understand Datasets have additional optimization for the computation graph.
我知道数据集对计算图有额外的优化。 How much improvement, if any, should I expect if I transit to Datasets?
如果我过渡到数据集,我应该期待多少改进(如果有)?
Shuffle process is better optimized, but otherwise you cannot expect any case specific optimizations. Shuffle 过程得到了更好的优化,但除此之外,您不能期望任何特定于案例的优化。
wish to know for every row the number of values happening in +-1 timestamp.
希望知道每行 +-1 时间戳中发生的值的数量。
Try window functions:尝试窗口函数:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val w = Window.partitionBy("id").ordetBy("timestamp")
rdd.toDF("id", "data")
.select($"id", $"data._1" as "timestamp", $"data._2" as "value"))
.withColumn("lead", lead($"value", 1).over(w))
.withColumn("lag", lag($"value", 1).over(w))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.