How to package an application in a docker image capable of running on a Spark cluster in Kubernetes?

Question

I am new to running Spark on Kubernetes and I have a very simple app I am trying to package to run in the Spark on K8s cluster that I have setup. The challenge I am facing is how to package my app to run in Spark? I installed Spark on K8's from the Spark Operator and I notice that the examples use the image gcr.io/spark-operator/spark:v3.0.0 .

I noticed the Spark documentation mentions this docker-image-tool.sh script to generate an app. but this looks like it would be for custom tailoring the environment. I would just like a lightweight image my app can use to run on the Spark cluster. Not sure how to tie it all together.

So I see 2 options from the documentation

Use this docker file as a base image.
use the docker-image-tool.sh script.

But I'm not sure which option I need to go for or when to use either option? Why use one over the other? Is there another option? Is there a prebuilt image that I can just copy my application into and use that to run?

Answer 1

Spark's docker-image-tool.sh is a tool script to create Spark's image. If you want a lightweight docker image, you can just tweak the Dockerfile that comes with the project or write your own - don't forget to edit the entrypoint.sh script as well.

Generally the steps for getting your Spark App to Kubernetes goes like this:

Create a base spark image for you to use.
Push it to a docker registry - if you are using minikube, the you can use the -m flag to move it to the minikube env.
Write a Spark Application, Package it and then run the spark-submit command.

Note: If you are not invested so much in Kubernetes and just want to quickly try this whole platform out, you can just proxy the kube-api-server by running the following command:

kubectl proxy

It would start serving the api server on localhost:8001 , then you can submit your spark app by running a command like

bin/spark-submit \
--master k8s://http:localhost:8001 \ #If you don't specify the protocol here, it'd default to https
--deploy-mode client \ #if you want to go for `cluster mode` then you'd have to set up a service account
--name sparkle \
--class edu.girdharshubham \ 
--conf spark.executor.instances=1 \ #Number of pods
--conf spark.kubernetes.container.image=<your-spark-image>
--conf spark.kubernetes.driver.pod.name="sparkle"
--conf spark.kubernetes.hadoop.configMapName="HADOOP_CONF_DIR"
--conf spark.kubernetes.executor.deleteOnTermination
path/to/jar

Consideration for running in client mode:

Make sure that the node - VM - you are running the aforementioned command from is network addressable from the pods
executor(pod) will run a JAVA process
The default Dockerfile uses tini - tiny but valid init for containers.
spark.kubernetes.executor.deleteOnTermination - This conf should be your go to conf if you are just starting out, by default pods are deleted in case of failure or normal termination. This would help you debug rather quickly about what's going on with your executor pods - whether they are failing or not.

Answer 2

I think I confused you a little bit. Readying a spark distribution's image and packaging your app are two separate things. Here's how you'd deploy your app using Kuberentes as a scheduler.

Step 1: Build Spark's image

./bin/docker-image-tool.sh -r asia.gcr.io/menace -t v3.0.0 build

Step 2: Push the image to your Docker registry

./bin/docker-image-tool.sh -r asia.gcr.io/menace -t v3.0.0 push

Step 3: Configure your Kubernetes to be able to pull your image, In most cases, it just requires setting up imagePullSecrets .

Pull Images from a private registry

Step 4: Write your spark app

package edu.girdharshubham

import scala.math.random

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession


object Solution {
  def main(args: Array[String]) {
    val spark = SparkSession
      .builder
      .master("k8s://http://localhost:8001")
      .config("spark.submit.deployMode","client")
      .config("spark.executor.instances", "2")
      .config("spark.kubernetes.container.image", "asia.gcr.io/menace/spark:v3.0.0")
      .appName("sparkle")
      .getOrCreate()

    import spark.implicits._
    val someDF = Seq(
      (8, "bat"),
      (64, "mouse"),
      (-27, "horse")
    ).toDF("number", "word")


    println("========================================================")
    println(someDF.take(1).foreach(println))
    println("========================================================")

    spark.stop()
  }
}

Step 4: Run your application

sbt run

This would result in executor pods being spawned on your cluster.

Step 5: Package your application

sbt package

Step 6: Use the spark submit command to run your app - Refer my initial answer

Now coming on to your question on packaging Spark's distribution, be careful about the version that you package and the dependencies that you use. Spark is a little iffy about versions.

How to package an application in a docker image capable of running on a Spark cluster in Kubernetes?

Question

2 answers

solution1
3 2020-09-10 23:59:26

solution2
0 2020-09-11 17:34:41

How to package an application in a docker image capable of running on a Spark cluster in Kubernetes?

Question

2 answers

solution1 3 2020-09-10 23:59:26

solution2 0 2020-09-11 17:34:41

solution1
3 2020-09-10 23:59:26

solution2
0 2020-09-11 17:34:41