Java - Apache Spark communication

Question

I'm quite new to Spark and was looking for some guidance :-)

What's the typical way in which a Java MVC application communicates with Spark? To simplify things, let's say I want to count the words in a certain file whose name is provided via GET request to my server.

My initial approach was to open the context and implement the transformations/ computations in a class inside my MVC application. That means that at runtime I would have to come up with an uber jar of spark-core. The problem is that:

The uber jar weights 80mb
I am facing the same problem (akka.version) than in: apache spark: akka version error by build jar with all dependencies
I can have a go with shade to solve it but have the feeling this is not the way to go.

Maybe the "provided" scope in Maven would help me but I'm using ant.

Should my application - as suggested in the page - have already one jar with the implementation (devoid of any spark libraries) and use the spark-submit every time I receive a request. I guess it would leave the results somewhere.

Am I missing any middle-of-the-road approach?

Answer 1

Using spark-submit each time is kind of heavy weight, I'd recommend using a long running Spark Context of some sort. I think the "middle of the road" option that you might be looking for is to have your job use something like the IBM Spark Kernel, Zepplin, or the Spark Job Server from Ooyala.

Answer 2

There is a good practice to use middleware service deployed on a top of Spark which manages it's contexts, job failures spark vesions and a lot of other things to consider.

I would recommend Mist. It implements Spark as a Service and creates a unified API layer for building enterprise solutions and services on top of a Big Data lake.

Mist supports Scala and Python jobs execution.

The quick start is following:

Add Mist wrapper into your Spark job:
Scala example:

 object SimpleContext extends MistJob { override def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = { val numbers: List[BigInt] = parameters("digits").asInstanceOf[List[BigInt]] val rdd = context.parallelize(numbers) Map("result" -> rdd.map(x => x * 2).collect()) } }

Python example:

 import mist class MyJob: def __init__(self, job): job.sendResult(self.doStuff(job)) def doStuff(self, job): val = job.parameters.values() list = val.head() size = list.size() pylist = [] count = 0 while count < size: pylist.append(list.head()) count = count + 1 list = list.tail() rdd = job.sc.parallelize(pylist) result = rdd.map(lambda s: 2 * s).collect() return result if __name__ == "__main__": job = MyJob(mist.Job())

Run Mist service:

Build the Mist

 git clone https://github.com/hydrospheredata/mist.git cd mist ./sbt/sbt -DsparkVersion=1.5.2 assembly # change version according to your installed spark

Create configuration file

 mist.spark.master = "local[*]" mist.settings.threadNumber = 16 mist.http.on = true mist.http.host = "0.0.0.0" mist.http.port = 2003 mist.mqtt.on = false mist.recovery.on = false mist.contextDefaults.timeout = 100 days mist.contextDefaults.disposable = false mist.contextDefaults.sparkConf = { spark.default.parallelism = 128 spark.driver.memory = "10g" spark.scheduler.mode = "FAIR" }

Run

 spark-submit --class io.hydrosphere.mist.Mist \\ --driver-java-options "-Dconfig.file=/path/to/application.conf" \\ target/scala-2.10/mist-assembly-0.2.0.jar

Try curl from terminal:

 curl --header "Content-Type: application/json" -X POST http://192.168.10.33:2003/jobs --data '{"jarPath":"/vagrant/examples/target/scala-2.10/mist_examples_2.10-0.2.0.jar", "className":"SimpleContext$","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]}, "external_id":"12345678","name":"foo"}'

Java - Apache Spark communication

Question

2 answers

solution1
2 ACCPTED 2015-06-06 01:09:10

solution2
2 2016-06-08 09:07:54

Java - Apache Spark communication

Question

2 answers

solution1 2 ACCPTED 2015-06-06 01:09:10

solution2 2 2016-06-08 09:07:54

solution1
2 ACCPTED 2015-06-06 01:09:10

solution2
2 2016-06-08 09:07:54