简体繁体 English

从Java的另一个应用程序部署Apache Spark应用程序，最佳实践

[英]Deploy Apache Spark application from another application in Java, best practice

原文 2015-03-26 15:23:43 7 3 java/ web-services/ deployment/ apache-spark/ spark-jobserver

I am a new user of Spark. 我是Spark的新用户。 I have a web service that allows a user to request the server to perform a complex data analysis by reading from a database and pushing the results back to the database. 我有一个Web服务，该服务允许用户通过从数据库读取并将结果推回数据库来请求服务器执行复杂的数据分析。 I have moved those analysis's into various Spark applications. 我已经将那些分析转移到了各种Spark应用程序中。 Currently I use spark-submit to deploy these applications. 目前，我使用spark-submit来部署这些应用程序。

However, I am curious, when my web server (written in Java) receives a user request, what is considered the "best practice" way to initiate the corresponding Spark application? 但是，我很好奇，当我的Web服务器（用Java编写）收到用户请求时，什么是启动相应Spark应用程序的“最佳实践”方法？ Spark's documentation seems to be to use "spark-submit" but I would rather not pipe out the command to a terminal to perform this action. Spark的文档似乎是使用“ spark-submit”，但我宁愿不将命令传递给终端以执行此操作。 I saw an alternative, Spark-JobServer, which provides an RESTful interface to do exactly this, but my Spark applications are written in either Java or R, which seems to not interface well with Spark-JobServer. 我看到了一个替代方案Spark-JobServer，它提供了一个RESTful接口来完成此操作，但是我的Spark应用程序是用Java或R编写的，似乎无法与Spark-JobServer很好地接口。

Is there another best-practice to kickoff a spark application from a web server (in Java), and wait for a status result whether the job succeeded or failed? 是否有另一种最佳实践来从Web服务器（Java）中启动Spark应用程序，并等待状态结果（无论作业成功还是失败）？

Any ideas of what other people are doing to accomplish this would be very helpful! 任何其他人为实现此目的正在做什么的想法都将非常有帮助！ Thanks! 谢谢！

3 个解决方案

I've had a similar requirement. 我也有类似的要求。 Here's what I did: 这是我所做的：

To submit apps, I use the hidden Spark REST Submission API: http://arturmkrtchyan.com/apache-spark-hidden-rest-api 要提交应用程序，我使用隐藏的Spark REST提交API： http : //arturmkrtchyan.com/apache-spark-hidden-rest-api
Using this same API you can query status for a Driver or you can Kill your Job later 使用相同的API，您可以查询驱动程序的状态，也可以稍后取消工作
There's also another hidden UI Json API: http://[master-node]:[master-ui-port]/json/ which exposes all information available on the master UI in JSON format. 还有另一个隐藏的UI Json API： http：// [master-node]：[master-ui-port] / json / ，以JSON格式公开了主UI上可用的所有信息。

Using "Submission API" I submit a driver and using the "Master UI API" I wait until my Driver and App state are RUNNING 使用“提交API”提交驱动程序，并使用“主UI API”等待驱动程序和应用程序状态为RUNNING

The web server can also act as the Spark driver. Web服务器还可以充当Spark驱动程序。 So it would have a SparkContext instance and contain the code for working with RDDs. 因此它将具有一个SparkContext实例，并包含用于RDD的代码。

The advantage of this is that the Spark executors are long-lived. 这样做的好处是，Spark执行程序的寿命很长。 You save time by not having to start/stop them all the time. 您不必一直启动/停止它们，从而节省了时间。 You can cache RDDs between operations. 您可以在操作之间缓存RDD。

A disadvantage is that since the executors are running all the time, they take up memory that other processes in the cluster could possibly use. 缺点是，由于执行程序一直在运行，因此它们占用了群集中其他进程可能使用的内存。 Another one is that you cannot have more than one instance of the web server, since you cannot have more than one SparkContext to the same Spark application. 另一个问题是，您不能拥有一个以上的Web服务器实例，因为同一个Spark应用程序不能拥有多个SparkContext 。