Spark scala reduce multiple filtering possible on RDD?

Question

def isSmallerScore(value:Int): Boolean ={
  val const = 200
  if(value < const) true else false
}
val rdd = sc.parallelize(Seq(("Java", 100), ("Python", 200), ("Scala", 300)))
val result1: RDD[(String, Int)] = rdd.filter(x => isSmallerScore(x._2))
val result2: RDD[(String, Int)] = rdd.filter(x => !isSmallerScore(x._2))

From the above code using a filter, I have created two RDD. One is with the smaller score size and another RDD is with the higher score. Here to separate it out I have done the filter action two times.

Is it possible to create in a single filter action? How can reduce another filter action to find out the result(either result1 or result2 )

Answer 1

It's not ETL like Informatica BDM, Talend, Pentaho et al. Where you have multiple pipelines running in parallel (branches) that you can create graphically.

You need to cache rdd and filter twice to get 2 RDDs.

Spark scala reduce multiple filtering possible on RDD?

Question

1 answers

solution1
0 2021-02-22 19:03:26

Spark scala reduce multiple filtering possible on RDD?

Question

1 answers

solution1 0 2021-02-22 19:03:26

solution1
0 2021-02-22 19:03:26