简体   繁体   中英

Databricks Spark notebook re-using Scala objects between runs?

I have written an Azure Databricks scala notebook (based on a JAR library), and I run it using a Databricks job once every hour.

In the code, I use the Application Insights Java SDK for log tracing, and init a GUID that marks the "RunId". I do this in a Scala 'object' constructor:

object AppInsightsTracer
{
  TelemetryConfiguration.getActive().setInstrumentationKey("...");
  val tracer = new TelemetryClient();
  val properties = new java.util.HashMap[String, String]()
  properties.put("RunId", java.util.UUID.randomUUID.toString);

  def trackEvent(name: String)
  {
    tracer.trackEvent(name, properties, null)
  }
}

The notebook itself simply calls the code in the JAR:

import com.mypackage._
Flow.go()

I expect to have a different "RunId" every hour. The weird behavior I am seeing is that for all runs, I get exactly the same "RunId" in the logs! As if the Scala object constructor code is run exactly once, and is re-used between notebook runs...

Do Spark/Databricks notebooks retain context between runs? If so how can this be avoided?

A Jupyter notebook spawns a Spark session (think of it as a process) and keeps it alive until it either dies, or you restart it explicitly. The object is a singleton, so it's initialized once and will be the same for all cell executions of the notebook.

You start with a new context every time you refresh the notebook.

I would recommend saving your RunId to a file to disk, then reading that file on every notebook run and then increment the RunId in the file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM