简体   繁体   English

无法在Intellij Idea中的Scala工作表中运行Spark

[英]Cannot make Spark run inside a scala worksheet in Intellij Idea

The following code runs with no problems if I put it inside an object which extends the App trait and run it using Idea's run command. 如果将以下代码放入扩展了应用程序特征的对象中,并使用Idea的run命令运行该代码,则该代码将run

However, when I try running it from a worksheet, I encounter one of these scenarios: 但是,当我尝试从工作表运行它时,遇到以下情况之一:

1- If the first line is present, I get: 1-如果第一行存在,我得到:

Task not serializable: java.io.NotSerializableException:A$A34$A$A34 任务无法序列化:java.io.NotSerializableException:A $ A34 $ A $ A34

2- If the first line is commented out, I get: 2-如果第一行被注释掉,我得到:

Unable to generate an encoder for inner class A$A35$A$A35$A12 without access to the scope that this class was defined in. 无法访问内部类A $ A35 $ A $ A35 $ A12的编码器,而无法访问定义该类的范围。

//First line!
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StructField, StructType}

case class AClass(id: Int, f1: Int, f2: Int)
val spark = SparkSession.builder()
  .master("local[*]")
  .appName("Test App")
  .getOrCreate()
import spark.implicits._

val schema = StructType(Array(
  StructField("id", IntegerType),
  StructField("f1", IntegerType),
  StructField("f2", IntegerType)))

val df = spark.read.schema(schema)
  .option("header", "true")
  .csv("dataset.csv")

// Displays the content of the DataFrame to stdout
df.show()
val ads = df.as[AClass]

//This is the line that causes serialization error
ads.foreach(x => println(x))

The project has been created using Idea's Scala plugin, and this is my build.sbt: 该项目已使用Idea的Scala插件创建,这是我的build.sbt:

   ...
   scalaVersion := "2.10.6"
   scalacOptions += "-unchecked"
   libraryDependencies ++= Seq(
       "org.apache.spark" % "spark-core_2.10" % "2.1.0",
       "org.apache.spark" % "spark-sql_2.10" % "2.1.0",
       "org.apache.spark" % "spark-mllib_2.10" % "2.1.0"
       )

I tried the solution in this answer. 我尝试了答案中的解决方案。 But it is not working for Idea Ultimate 2017.1 which I am using and also, when I use worksheets, I prefer not to add an extra object to the worksheet if at all possible. 但是它不适用于我正在使用的Idea Ultimate 2017.1,而且,当我使用工作表时,我尽可能不要在工作表中添加额外的对象。

if I use collect() method on the dataset object and get an Array of "Aclass" instances, there will be no more errors either. 如果我在数据集对象上使用collect()方法并获取“ Aclass”实例的数组,则也不会再有错误。 It is trying to work with the DS directly that causes the error. 它正在尝试直接与导致错误的DS一起使用。

使用Eclipse兼容模式(在Languages&Frameworks中打开Preferences-> type scala->,选择Scala-> Choose Worksheet->仅选择Eclipse兼容模式)请参阅https://gist.github.com/RAbraham/585939e5390d46a7d6f8

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM