简体   繁体   English

spark-shell (scala) 暂存的 SparkSession 变量是 val 还是 var?

[英]Is the SparkSession variable staged by spark-shell (scala) a val or a var?

I am trying to convert my Spark Scala scripts (written in spark-shell ) as Scala Class, Object, methods (def), etc. so I create JARs for spark-submit .我正在尝试将我的 Spark Scala 脚本(用spark-shell编写)转换为 Scala 类、对象、方法(def)等,因此我为spark-submit创建了 JAR。 I make a lot of calls using Spark SQL that performs a lot of timestamp computations with respect to timezone.我使用 Spark SQL 进行了大量调用,这些调用针对时区执行了大量时间戳计算。 I have to set the following configuration explicitly (because every distributed node may have different default timezone configured) to make sure my timezone will always be UTC for any subsequent Spark SQL timestamp manipulations by any Spark SQL function calls (code block) within that method.我必须显式设置以下配置(因为每个分布式节点可能配置了不同的默认时区),以确保我的时区始终为 UTC,以便通过该方法中的任何 Spark SQL 函数调用(代码块)对任何后续 Spark SQL 时间戳进行操作。

spark.conf.set("spark.sql.session.timeZone", "UTC")

Should that method signature includes (spark: org.apache.spark.sql.SparkSession) as a parameter, so I can always start with the explicit code statement for setting the timezone to UTC to SparkSession without taking any chances (that all the distributed Spark nodes may or may not have the exact same timezone configurations)?如果该方法签名包括 (spark: org.apache.spark.sql.SparkSession) 作为参数,那么我总是可以从显式代码语句开始,用于将时区设置为 UTC 到SparkSession而不SparkSession任何机会(所有分布式 Spark节点可能有也可能没有完全相同的时区配置)?

The next concerning question I have is, how do I find out if the "spark" variable setup by the spark-shell is a val or var ?我的下一个相关问题是,如何确定spark-shell设置的“spark”变量是val还是var In search of the answer to this question, I found this code snippet in the hope to find out if this Scala variable is immutable or mutable .为了寻找这个问题的答案,我找到了这个代码片段,希望能找出这个 Scala 变量是immutable还是mutable But it did not tell me if Scala variable spark is a var or a val .但它没有告诉我 Scala 变量sparkvar还是val Do I need to return spark back to the method caller after I set the spark.sql.session.timeZone to UTC because I modified it in my method?在将spark.sql.session.timeZone设置为UTC后,我是否需要将spark返回给方法调用者,因为我在我的方法中修改了它? Currently my method signature expects two input parameters (org.apache.spark.sql.SparkSession, org.apache.spark.sql.DataFrame) and the output is a tuple (org.apache.spark.sql.SparkSession, org.apache.spark.sql.DataFrame) .目前我的方法签名需要两个输入参数(org.apache.spark.sql.SparkSession, org.apache.spark.sql.DataFrame) ,输出是一个元组(org.apache.spark.sql.SparkSession, org.apache.spark.sql.DataFrame)

scala> def manOf[T: Manifest](t: T): Manifest[T] = manifest[T]
manOf: [T](t: T)(implicit evidence$1: Manifest[T])Manifest[T]

scala> manOf(List(1))
res3: Manifest[List[Int]] = scala.collection.immutable.List[Int]

scala> manOf(spark)
res2: Manifest[org.apache.spark.sql.SparkSession] = org.apache.spark.sql.SparkSession

Extra context: As part of launching spark-shell , the variable spark is initialized as follows:额外的上下文:作为启动spark-shell ,变量spark初始化如下:

Spark context available as 'sc' (master = yarn, app id = application_1234567890_111111).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_REDACTED)
Type in expressions to have them evaluated.
Type :help for more information.

Thanks to @Luis Miguel Mejía Suárez for providing me the answers & hints as comments to my question.感谢@Luis Miguel Mejía Suárez为我提供答案和提示作为对我的问题的评论。 I implemented the following experiment that spark is a mutable object, where I was just using spark as an identical reference to the same object outside the method and inside the method.我实现了以下实验,即spark是一个可变对象,其中我只是使用spark作为对方法外部和方法内部相同对象的相同引用。 Although this undesirable side-effect is not a pure functional implementation, but it does save me the trouble to return the spark object back to the caller for other subsequent processing.虽然这种不受欢迎的副作用不是纯函数式实现,但它确实为我省去了将spark对象返回给调用者进行其他后续处理的麻烦。 If someone else has a better solution, please do share.如果其他人有更好的解决方案,请分享。

def x(spark: SparkSession, inputDF: DataFrame) = {
  import spark.implicits._
  spark.conf.set("spark.sql.session.timeZone", "UTC") // mutation of the object inside method

  //...spark.sql.functions...
  finalDF
}

Launched spark-shell and executed the following:启动spark-shell并执行以下操作:

Spark context available as 'sc' (master = yarn, app id = application_1234567890_222222).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_REDACTED)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.conf.get("spark.sql.session.timeZone")
res1: String = America/New_York

scala> :load x.scala
x: (spark: org.apache.spark.sql.SparkSession, inputDF: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame

scala> val timeConvertedDF = x(spark, inputDF)
timeConvertedDF: org.apache.spark.sql.DataFrame = [att1: timestamp, att2: string ... 25 more fields]

scala> spark.conf.get("spark.sql.session.timeZone")
res4: String = UTC

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM