Please note: Although this question mentions Spark (2.1) I think this is really a Scala (2.11) question at its heart, and that any well-versed Scala dev will be able to answer it!
I have the following code that creates a Spark Dataset (basically a 2D table) and iterates it row by row. If a particular row's username
column has a value of "fizzbuzz", then I want to set a variable defined outside of the iterator and use that variable after the row iteration has finished:
val myDataset = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "mykeyspace"))
.load()
var foobar : String
myDataset.collect().foreach(rec =>
if(rec.getAs("username") == "fizzbuzz") {
foobar = rec.getAs("foobarval")
}
)
if(foobar == null) {
throw new Exception("The fizzbuzz user was not found.")
}
When I run this I get the following exception:
error: class $iw needs to be abstract, since:
it has 2 unimplemented members.
/** As seen from class $iw, the missing signatures are as follows.
* For convenience, these are usable as stub implementations.
*/
def foobar=(x$1: String): Unit = ???
class $iw extends Serializable {
^
Any particular reason why I'm getting this?
Within a method or a non-abstract class, you must define a value for each variable; Here, you leave foobar
undefined. Things would work as expected if you define it to have the preliminary value of null
:
var foobar: String = null
BUT : note that your code is both non-idiomatic (not following Scala and Spark's best practices) and potentially risky/slow:
foobar
- immutable code is easier to reason about and would really let you take advantage of Scala's power collect
on a DataFrame unless you're sure it's very small, as collect
would collect all the data from the worker nodes (of which there are many, potentially) into a single Driver node, which would be slow and might cause OutOfMemoryError
. null
is discouraged (as it often leads to unexpected NullPointerException
s) A more idiomatic version of this code would use DataFrame.filter
to filter the relevant records, and probably Option
to properly represent the potentially-empty value, something like:
import spark.implicits._
val foobar: Option[String] = myDataset
.filter($"username" === "fizzbuzz") // filter only relevant records
.take(1) // get first 1 record (if it exists) as an Array[Row]
.headOption // get the first item in the array, or None
.map(r => r.getAs[String]("foobarval")) // get the value of the column "foobarval", or None
if (foobar.isEmpty) {
throw new Exception("The fizzbuzz user was not found.")
}
foobar
variable should be initialized:
var foobar: String = null
Also this doesn't look right:
foobar = rec.getAs("foobarval")
and should be:
foobar = rec.getAs[String]("foobarval")
Overall it is not the way to go. It doesn't benefit from Spark execution model at all. I'd filter and take instead:
myDataset.filter($"username" === "fizzbuzz").select("foobarval").take(1)
You should probably be using filters and selects on your dataframe:
import spark.sqlContext.implicits._
val data = spark.sparkContext.parallelize(List(
"""{ "username": "none", "foobarval":"none" }""",
"""{ "username": "fizzbuzz", "foobarval":"expectedval" }"""))
val df = spark.read.json(data)
val foobar = df.filter($"username" === "fizzbuzz").select($"foobarval").collect.head.getString(0)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.