Java-Apache Spark通信

Question

I'm quite new to Spark and was looking for some guidance :-) 我是Spark的新手，正在寻找一些指南:-)

What's the typical way in which a Java MVC application communicates with Spark? Java MVC应用程序与Spark通信的典型方式是什么？ To simplify things, let's say I want to count the words in a certain file whose name is provided via GET request to my server. 为简化起见，假设我要对某个文件中的单词计数，该文件的名称是通过GET请求提供给我的服务器的。

My initial approach was to open the context and implement the transformations/ computations in a class inside my MVC application. 我最初的方法是打开上下文，并在MVC应用程序内的类中实现转换/计算。 That means that at runtime I would have to come up with an uber jar of spark-core. 这意味着在运行时，我将不得不想到一个超级罐子的spark-core。 The problem is that: 问题是：

The uber jar weights 80mb 超级罐子重量80mb
I am facing the same problem (akka.version) than in: apache spark: akka version error by build jar with all dependencies 我遇到的问题与（akka.version）相同： apache spark：具有所有依赖项的生成jar的akka版本错误
I can have a go with shade to solve it but have the feeling this is not the way to go. 我可以尝试阴影解决它，但是有种感觉，这不是可行的方法。

Maybe the "provided" scope in Maven would help me but I'm using ant. 也许Maven中的“提供”范围会帮助我，但我使用的是蚂蚁。

Should my application - as suggested in the page - have already one jar with the implementation (devoid of any spark libraries) and use the spark-submit every time I receive a request. 我的应用程序-如页面中所建议-是否已经在实现中添加了一个jar（没有任何spark库），并且每次我收到请求时都使用spark-submit。 I guess it would leave the results somewhere. 我想它将结果留在某处。

Am I missing any middle-of-the-road approach? 我是否缺少任何中间方法？

Answer 1

Using spark-submit each time is kind of heavy weight, I'd recommend using a long running Spark Context of some sort. 每次使用spark-submit都是一件繁重的事情，我建议使用某种长时间运行的Spark Context。 I think the "middle of the road" option that you might be looking for is to have your job use something like the IBM Spark Kernel, Zepplin, or the Spark Job Server from Ooyala. 我认为您可能正在寻找的“中间道路”选项是让您的工作使用IBM Spark Kernel，Zepplin或Ooyala的Spark Job Server之类的东西。

Answer 2

There is a good practice to use middleware service deployed on a top of Spark which manages it's contexts, job failures spark vesions and a lot of other things to consider. 有一种很好的做法是使用部署在Spark顶部的中间件服务，该中间件服务管理其上下文，工作失败，引发错误以及许多其他需要考虑的事情。

I would recommend Mist. 我会推荐薄雾。 It implements Spark as a Service and creates a unified API layer for building enterprise solutions and services on top of a Big Data lake. 它实现了Spark即服务，并创建了统一的API层，用于在大数据湖的顶部构建企业解决方案和服务。

Mist supports Scala and Python jobs execution. Mist支持Scala和Python作业执行。

The quick start is following: 快速入门如下：

Add Mist wrapper into your Spark job: 将薄雾包装器添加到您的Spark作业中：
Scala example: Scala示例：

 object SimpleContext extends MistJob { override def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = { val numbers: List[BigInt] = parameters("digits").asInstanceOf[List[BigInt]] val rdd = context.parallelize(numbers) Map("result" -> rdd.map(x => x * 2).collect()) } }

Python example: Python示例：

 import mist class MyJob: def __init__(self, job): job.sendResult(self.doStuff(job)) def doStuff(self, job): val = job.parameters.values() list = val.head() size = list.size() pylist = [] count = 0 while count < size: pylist.append(list.head()) count = count + 1 list = list.tail() rdd = job.sc.parallelize(pylist) result = rdd.map(lambda s: 2 * s).collect() return result if __name__ == "__main__": job = MyJob(mist.Job())

Run Mist service: 运行雾服务：

Build the Mist 造雾

 git clone https://github.com/hydrospheredata/mist.git cd mist ./sbt/sbt -DsparkVersion=1.5.2 assembly # change version according to your installed spark

Create configuration file 创建配置文件

 mist.spark.master = "local[*]" mist.settings.threadNumber = 16 mist.http.on = true mist.http.host = "0.0.0.0" mist.http.port = 2003 mist.mqtt.on = false mist.recovery.on = false mist.contextDefaults.timeout = 100 days mist.contextDefaults.disposable = false mist.contextDefaults.sparkConf = { spark.default.parallelism = 128 spark.driver.memory = "10g" spark.scheduler.mode = "FAIR" }

Run 跑

 spark-submit --class io.hydrosphere.mist.Mist \\ --driver-java-options "-Dconfig.file=/path/to/application.conf" \\ target/scala-2.10/mist-assembly-0.2.0.jar

Try curl from terminal: 尝试从终端卷曲：

 curl --header "Content-Type: application/json" -X POST http://192.168.10.33:2003/jobs --data '{"jarPath":"/vagrant/examples/target/scala-2.10/mist_examples_2.10-0.2.0.jar", "className":"SimpleContext$","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]}, "external_id":"12345678","name":"foo"}'

Java-Apache Spark通信

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-06-06 01:09:10

解决方案2
2 2016-06-08 09:07:54

Java-Apache Spark通信

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-06-06 01:09:10

解决方案2 2 2016-06-08 09:07:54

解决方案1
2 已采纳 2015-06-06 01:09:10

解决方案2
2 2016-06-08 09:07:54