简体   繁体   中英

Spark: Fastest way to look up an element in an RDD

I have a custom class E which has, among others, a field word . I have a large es: RDD[E] with several 100000 elements and a doc: Seq[String] with typically a few hundred entries. In es , every element's word field value is unique.

My task is to look up the element in es for each of the strings in doc . It is however not guaranteed that such an element exists. My naive Scala/Spark implementation is thus:

def word2E(words: Seq[String]): Seq[E] = {
  words.map(lookupWord(_, es))
    .filter(_.isDefined)
    .map(_.get)
}

The method lookupWord() is defined as follows:

def lookupWord(w: String, es: RDD[E]): Option[E] = {
  val lookup = es.filter(_.word.equals(w))

  if (lookup.isEmpty) None
  else Some(lookup.first)
}

When I look at the Spark stages overview, it seems like lookupWord() is a bottleneck. In particular, the isEmpty() calls in lookupWord take relatively long (up to 2s) in some cases.

I have already persisted the es RDD. Is there any other leverage for optimizing such a task or is this just as good as it gets when operating on such a dataset?

I have noticed the lookup() method in PairRDDFunctions and considered to construct a PairRDD in which the word field would serve as the key. Might that help? Drawing any conclusions experimentally here is quite hard because there are so many factors involved.

The problem with your implementation is that you trigger for each word in words a complete traversal of your RDD and then collect the elements. One way to solve your problem is to join the sequence of words with your RDD :

case class E(word: String, value: Int)

object App {

  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("Test").setMaster("local[4]")
    val sc = new SparkContext(sparkConf)

    val entries = sc.parallelize(List(E("a", 1), E("b", 2), E("c", 3), E("c", 3)))

    val words = Seq("a", "a", "c")

    val wordsRDD = sc.parallelize(words).map(x => (x, x))

    val matchingEntries = entries
      .map(x => (x.word, x))
      .join(wordsRDD)
      .map{
        case (_, (entry, _)) => entry
      }
      .collect

    println(matchingEntries.mkString("\n"))
  }
}

The output is

E(a,1)
E(a,1)
E(c,3)
E(c,3)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM