简体   繁体   中英

Java - Apache Spark communication

I'm quite new to Spark and was looking for some guidance :-)

What's the typical way in which a Java MVC application communicates with Spark? To simplify things, let's say I want to count the words in a certain file whose name is provided via GET request to my server.

My initial approach was to open the context and implement the transformations/ computations in a class inside my MVC application. That means that at runtime I would have to come up with an uber jar of spark-core. The problem is that:

Maybe the "provided" scope in Maven would help me but I'm using ant.

Should my application - as suggested in the page - have already one jar with the implementation (devoid of any spark libraries) and use the spark-submit every time I receive a request. I guess it would leave the results somewhere.

Am I missing any middle-of-the-road approach?

Using spark-submit each time is kind of heavy weight, I'd recommend using a long running Spark Context of some sort. I think the "middle of the road" option that you might be looking for is to have your job use something like the IBM Spark Kernel, Zepplin, or the Spark Job Server from Ooyala.

There is a good practice to use middleware service deployed on a top of Spark which manages it's contexts, job failures spark vesions and a lot of other things to consider.

I would recommend Mist. It implements Spark as a Service and creates a unified API layer for building enterprise solutions and services on top of a Big Data lake.

Mist supports Scala and Python jobs execution.

The quick start is following:

  1. Add Mist wrapper into your Spark job:
    Scala example:

     object SimpleContext extends MistJob { override def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = { val numbers: List[BigInt] = parameters("digits").asInstanceOf[List[BigInt]] val rdd = context.parallelize(numbers) Map("result" -> rdd.map(x => x * 2).collect()) } } 

    Python example:

     import mist class MyJob: def __init__(self, job): job.sendResult(self.doStuff(job)) def doStuff(self, job): val = job.parameters.values() list = val.head() size = list.size() pylist = [] count = 0 while count < size: pylist.append(list.head()) count = count + 1 list = list.tail() rdd = job.sc.parallelize(pylist) result = rdd.map(lambda s: 2 * s).collect() return result if __name__ == "__main__": job = MyJob(mist.Job()) 
  2. Run Mist service:

    Build the Mist

     git clone https://github.com/hydrospheredata/mist.git cd mist ./sbt/sbt -DsparkVersion=1.5.2 assembly # change version according to your installed spark 

    Create configuration file

     mist.spark.master = "local[*]" mist.settings.threadNumber = 16 mist.http.on = true mist.http.host = "0.0.0.0" mist.http.port = 2003 mist.mqtt.on = false mist.recovery.on = false mist.contextDefaults.timeout = 100 days mist.contextDefaults.disposable = false mist.contextDefaults.sparkConf = { spark.default.parallelism = 128 spark.driver.memory = "10g" spark.scheduler.mode = "FAIR" } 

    Run

     spark-submit --class io.hydrosphere.mist.Mist \\ --driver-java-options "-Dconfig.file=/path/to/application.conf" \\ target/scala-2.10/mist-assembly-0.2.0.jar 
  3. Try curl from terminal:

     curl --header "Content-Type: application/json" -X POST http://192.168.10.33:2003/jobs --data '{"jarPath":"/vagrant/examples/target/scala-2.10/mist_examples_2.10-0.2.0.jar", "className":"SimpleContext$","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]}, "external_id":"12345678","name":"foo"}' 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM