简体   繁体   中英

Spark/Scala iterator unable to assign variables defined outside of foreach loop

Please note: Although this question mentions Spark (2.1) I think this is really a Scala (2.11) question at its heart, and that any well-versed Scala dev will be able to answer it!


I have the following code that creates a Spark Dataset (basically a 2D table) and iterates it row by row. If a particular row's username column has a value of "fizzbuzz", then I want to set a variable defined outside of the iterator and use that variable after the row iteration has finished:

val myDataset = sqlContext
     .read
     .format("org.apache.spark.sql.cassandra")
     .options(Map("table" -> "mytable", "keyspace" -> "mykeyspace"))
     .load()

var foobar : String
myDataset.collect().foreach(rec =>
  if(rec.getAs("username") == "fizzbuzz") {
    foobar = rec.getAs("foobarval")
  }
)

if(foobar == null) {
  throw new Exception("The fizzbuzz user was not found.")
}

When I run this I get the following exception:

error: class $iw needs to be abstract, since:
it has 2 unimplemented members.
/** As seen from class $iw, the missing signatures are as follows.
 *  For convenience, these are usable as stub implementations.
 */
  def foobar=(x$1: String): Unit = ???

class $iw extends Serializable {
      ^

Any particular reason why I'm getting this?

Within a method or a non-abstract class, you must define a value for each variable; Here, you leave foobar undefined. Things would work as expected if you define it to have the preliminary value of null :

var foobar: String = null

BUT : note that your code is both non-idiomatic (not following Scala and Spark's best practices) and potentially risky/slow:

  • You should avoid mutable values such as foobar - immutable code is easier to reason about and would really let you take advantage of Scala's power
  • You should avoid calling collect on a DataFrame unless you're sure it's very small, as collect would collect all the data from the worker nodes (of which there are many, potentially) into a single Driver node, which would be slow and might cause OutOfMemoryError .
  • Use of null is discouraged (as it often leads to unexpected NullPointerException s)

A more idiomatic version of this code would use DataFrame.filter to filter the relevant records, and probably Option to properly represent the potentially-empty value, something like:

import spark.implicits._

val foobar: Option[String] = myDataset
  .filter($"username" === "fizzbuzz") // filter only relevant records
  .take(1) // get first 1 record (if it exists) as an Array[Row]
  .headOption // get the first item in the array, or None
  .map(r => r.getAs[String]("foobarval")) // get the value of the column "foobarval", or None

if (foobar.isEmpty) {
  throw new Exception("The fizzbuzz user was not found.")
}

foobar variable should be initialized:

var foobar: String = null

Also this doesn't look right:

foobar = rec.getAs("foobarval")

and should be:

foobar = rec.getAs[String]("foobarval")

Overall it is not the way to go. It doesn't benefit from Spark execution model at all. I'd filter and take instead:

myDataset.filter($"username" === "fizzbuzz").select("foobarval").take(1)

You should probably be using filters and selects on your dataframe:

import spark.sqlContext.implicits._

val data = spark.sparkContext.parallelize(List(
  """{ "username": "none", "foobarval":"none" }""",
  """{ "username": "fizzbuzz", "foobarval":"expectedval" }"""))

val df = spark.read.json(data)
val foobar = df.filter($"username" === "fizzbuzz").select($"foobarval").collect.head.getString(0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM