简体   繁体   中英

Spark scala reduce multiple filtering possible on RDD?

def isSmallerScore(value:Int): Boolean ={
  val const = 200
  if(value < const) true else false
}
val rdd = sc.parallelize(Seq(("Java", 100), ("Python", 200), ("Scala", 300)))
val result1: RDD[(String, Int)] = rdd.filter(x => isSmallerScore(x._2))
val result2: RDD[(String, Int)] = rdd.filter(x => !isSmallerScore(x._2))

From the above code using a filter, I have created two RDD. One is with the smaller score size and another RDD is with the higher score. Here to separate it out I have done the filter action two times.

Is it possible to create in a single filter action? How can reduce another filter action to find out the result(either result1 or result2 )

It's not ETL like Informatica BDM, Talend, Pentaho et al. Where you have multiple pipelines running in parallel (branches) that you can create graphically.

You need to cache rdd and filter twice to get 2 RDDs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM