Spark: Fastest way to look up an element in an RDD

Question

I have a custom class E which has, among others, a field word . I have a large es: RDD[E] with several 100000 elements and a doc: Seq[String] with typically a few hundred entries. In es , every element's word field value is unique.

My task is to look up the element in es for each of the strings in doc . It is however not guaranteed that such an element exists. My naive Scala/Spark implementation is thus:

def word2E(words: Seq[String]): Seq[E] = {
  words.map(lookupWord(_, es))
    .filter(_.isDefined)
    .map(_.get)
}

The method lookupWord() is defined as follows:

def lookupWord(w: String, es: RDD[E]): Option[E] = {
  val lookup = es.filter(_.word.equals(w))

  if (lookup.isEmpty) None
  else Some(lookup.first)
}

When I look at the Spark stages overview, it seems like lookupWord() is a bottleneck. In particular, the isEmpty() calls in lookupWord take relatively long (up to 2s) in some cases.

I have already persisted the es RDD. Is there any other leverage for optimizing such a task or is this just as good as it gets when operating on such a dataset?

I have noticed the lookup() method in PairRDDFunctions and considered to construct a PairRDD in which the word field would serve as the key. Might that help? Drawing any conclusions experimentally here is quite hard because there are so many factors involved.

Answer 1

The problem with your implementation is that you trigger for each word in words a complete traversal of your RDD and then collect the elements. One way to solve your problem is to join the sequence of words with your RDD :

case class E(word: String, value: Int)

object App {

  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("Test").setMaster("local[4]")
    val sc = new SparkContext(sparkConf)

    val entries = sc.parallelize(List(E("a", 1), E("b", 2), E("c", 3), E("c", 3)))

    val words = Seq("a", "a", "c")

    val wordsRDD = sc.parallelize(words).map(x => (x, x))

    val matchingEntries = entries
      .map(x => (x.word, x))
      .join(wordsRDD)
      .map{
        case (_, (entry, _)) => entry
      }
      .collect

    println(matchingEntries.mkString("\n"))
  }
}

The output is

E(a,1)
E(a,1)
E(c,3)
E(c,3)

Spark: Fastest way to look up an element in an RDD

Question

1 answers

solution1
1 ACCPTED 2015-08-26 09:16:54

Spark: Fastest way to look up an element in an RDD

Question

1 answers

solution1 1 ACCPTED 2015-08-26 09:16:54

solution1
1 ACCPTED 2015-08-26 09:16:54