简体   繁体   中英

What is happening when I call rdd.join(rdd)

I am developing an application where I need to perform calculations for each couple of rows with the same key in RDD, here is the RDD structure:

List<Tuple2<String, Tuple2<Integer, Integer>>> dat2 = new ArrayList<>();
dat2.add(new Tuple2<String, Tuple2<Integer, Integer>>("Alice", new Tuple2<Integer, Integer>(1, 1)));
dat2.add(new Tuple2<String, Tuple2<Integer, Integer>>("Alice", new Tuple2<Integer, Integer>(2, 5)));
dat2.add(new Tuple2<String, Tuple2<Integer, Integer>>("Alice", new Tuple2<Integer, Integer>(3, 78)));
dat2.add(new Tuple2<String, Tuple2<Integer, Integer>>("Bob", new Tuple2<Integer, Integer>(1, 6)));
dat2.add(new Tuple2<String, Tuple2<Integer, Integer>>("Bob", new Tuple2<Integer, Integer>(2, 11)));
JavaRDD<Tuple2<String, Tuple2<Integer, Integer>>> y2 = sc.parallelize(dat2);

now, data for every person can be viewed like: (timestamp, value). I wish to know for every row the number of values happening in +-1 timestamp. ( I am aware this looks like sliding window but I want event level granularity )

y2.join(y2);
resultOfJoin.filter(t -> t._2()._1()._1() - t._2()._2()._1() <= 1 && t._2()._1()._1() - t._2()._2()._1() >= -1)

The best solution I came for in this case was to join the RDD with itself, creating k^2 rows for every person where k is the number of rows associated with this person.

now, I do know this is a complete disaster . I understand this will cause a shuffle (and shuffles are bad m'key) but I couldn't come with anything better.

I have 3 questions:

  1. Since I filter right after the join, will it effect the stress caused by the join (in other words, will there be any optimizations)?
  2. What is the volume of rows passed on the network? (I am aware that in the worst case the result RDD will have n^2 rows) will the rows sent on network be #workers n (sending only one copy and duplicating on worker) or #workers n^2 (sending row for each 2 row combination on the result worker)?
  3. If I would of worked with Dataset I could join with filter. I understand Datasets have additional optimization for the computation graph. How much improvement, if any, should I expect if I transit to Datasets?

Since I filter right after the join, will it effect the stress caused by the join (in other words, will there be any optimizations)?

No, there will be no optimizations.

What is the volume of rows passed on the network?

O(N) (specifically each record will be shuffled twice, once for each parent) You join by key, so each item goes to one, and only one partition.

If I would of worked with Dataset I could join with filter. I understand Datasets have additional optimization for the computation graph. How much improvement, if any, should I expect if I transit to Datasets?

Shuffle process is better optimized, but otherwise you cannot expect any case specific optimizations.

wish to know for every row the number of values happening in +-1 timestamp.

Try window functions:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._

val w = Window.partitionBy("id").ordetBy("timestamp")

rdd.toDF("id", "data")
  .select($"id", $"data._1" as "timestamp", $"data._2" as "value"))
  .withColumn("lead", lead($"value", 1).over(w))
  .withColumn("lag", lag($"value", 1).over(w))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM