简体   繁体   中英

Filter RDD of tuples from another list of tuples

I am currently working with Spark and Scalacheck and I am trying to filter a RDD[(A,Long)] ( where A is a register read from an Avro file and Long is obtained from the zipWithUniqueId() function ) from a sample out of that same RDD stored in a buffer.

My intention is to test some properties on that sample and once it fails, test that property again in a sample from that RDD which does not contains any of the values sampled before. I am storing the rdd in a var so i can reassign it once i filter it. My code goes like this :

val samplingSeed = new Random(System.currentTimeMillis()).nextLong()
val sampled = rdd.takeSample(withReplacement = false, bufferSize, samplingSeed)
val buffer: JQueue[(A, Long)] = new JConcurrentLinkedQueue[(A, Long)]

//Sampled as Array converts to queue
for (i <- 0 to sampled.length - 1)
 buffer.add(sampled(i).asInstanceOf[(A, Long)])

//rdd is assigned to a var for persistence
//filter here and leave out all the tuples in buffer based in the 
//Long  value in each tuple
 rdd= rdd.filter{foo}

How could i achieve this?

In general, filtering by set can be done using broadcast variable:

val rdd = sc.parallelize((1 to 10).toSeq)
val ids = sc.broadcast(Set(1, 2, 3))
rdd.filter(v => !ids.value.contains(v)).collect()
res1: Array[Int] = Array(4, 5, 6, 7, 8, 9, 10)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM