简体   繁体   English

任务不可序列化-正则表达式

[英]Task not serializable - Regex

i have a movie which has a title. 我有一部有标题的电影。 In this title is the year of the movie like "Movie (Year)". 这个标题是电影的年份,例如“电影(年份)”。 I want to extract the Year and i'm using a regex for this. 我想提取年份,为此使用正则表达式。

case class MovieRaw(movieid:Long,genres:String,title:String)
case class Movie(movieid:Long,genres:Set[String],title:String,year:Int)
val regexYear = ".*\\((\\d*)\\)".r
moviesRaw.map{case MovieRaw(i,g,t) => Movie(i,g,t,t.trim() match { case regexYear(y) => Integer.parseInt(y)})}

When executing the last command i get the following Error: 当执行最后一条命令时,出现以下错误:

java.io.NotSerializableException: org.apache.spark.SparkConf

Running in the Spark/Scala REPL, with this SparkContext: val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost") val sc = new SparkContext(conf) 使用以下SparkContext在Spark / Scala REPL中运行: val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost") val sc = new SparkContext(conf)

As Dean explained, the reason of the problem is that the REPL creates a class out of the code added to the REPL and, in this case, the other variables in the same context are being "pulled" in the closure by the regex declaration. 正如Dean解释的那样,问题的原因是REPL从添加到REPL的代码中创建了一个类,在这种情况下,同一个上下文中的其他变量在正则表达式声明中被“拉”到闭包中。

Given the way you're creating the context, a simple way to avoid that serialization issue would be to declare the SparkConf and SparkContext transient: 根据您创建上下文的方式,一种避免序列化问题的简单方法是声明SparkConfSparkContext瞬态:

@transient val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
@transient val sc = new SparkContext(conf)

You don't even need to recreate the spark context in the REPL for the only purpose of connecting to Cassandra: 您甚至不需要为了连接到Cassandra的唯一目的而在REPL中重新创建spark上下文:

spark-shell --conf spark.cassandra.connection.host=localhost

You probably have this code in a larger Scala class or object (a type), right? 您可能在较大的Scala类或对象(一种类型)中有此代码,对吗? If so, in order to serialize the regexYear , the whole enclosing type gets serialized, but you probably have the SparkConf defined in that type. 如果是这样,为了序列化regexYear ,整个封闭类型都会被序列化,但是您可能已经在该类型中定义了SparkConf

This is a very common and confusing problem and efforts are underway to prevent it, given the constraints of the JVM and languages on top of it, like Java. 考虑到JVM和Java之上的语言的限制,这是一个非常普遍且令人困惑的问题,并且正在努力防止它。

The solution (for now) is to put regexYear inside a method or another object: 解决方案(目前)是将regexYear放在方法或另一个对象中:

object MyJob {
  def main(...) = {
    case class MovieRaw(movieid:Long,genres:String,title:String)
    case class Movie(movieid:Long,genres:Set[String],title:String,year:Int)
    val regexYear = ".*\\((\\d*)\\)".r
    moviesRaw.map{case MovieRaw(i,g,t) => Movie(i,g,t,t.trim() match { case     regexYear(y) => Integer.parseInt(y)})}
    ...
  }
}

or 要么

...
object small {
  case class MovieRaw(movieid:Long,genres:String,title:String)
  case class Movie(movieid:Long,genres:Set[String],title:String,year:Int)
  val regexYear = ".*\\((\\d*)\\)".r
  moviesRaw.map{case MovieRaw(i,g,t) => Movie(i,g,t,t.trim() match { case   regexYear(y) => Integer.parseInt(y)})}
}

Hope this helps. 希望这可以帮助。

Try passing in the cassandra option on the command line for spark-shell like this: 尝试在命令行上为spark-shell传递cassandra选项,如下所示:

spark-shell [other options] --conf spark.cassandra.connection.host=localhost

And that way you won't have to recreate the SparkContext -- you can use the SparkContext (sc) that gets instantiated automatically with spark-shell. 这样一来,您不必重新创建SparkContext -您可以使用可通过spark-shell自动实例化的SparkContext(sc)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM