简体   繁体   English

如何运行Spark Java程序

[英]How to run a Spark Java program

I have written a Java program for Spark. 我为Spark编写了一个Java程序。 But how to run and compile it from Unix command line. 但是如何从Unix命令行运行和编译它。 Do I have to include any jar while compiling for running 编译运行时是否必须包含任何jar

Combining steps from official Quick Start Guide and Launching Spark on YARN we get: 结合官方快速入门指南在YARN上启动Spark的步骤,我们得到:

We'll create a very simple Spark application, SimpleApp.java: 我们将创建一个非常简单的Spark应用程序SimpleApp.java:

/*** SimpleApp.java ***/
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;

public class SimpleApp {
  public static void main(String[] args) {
    String logFile = "$YOUR_SPARK_HOME/README.md"; // Should be some file on your system
    JavaSparkContext sc = new JavaSparkContext("local", "Simple App",
      "$YOUR_SPARK_HOME", new String[]{"target/simple-project-1.0.jar"});
    JavaRDD<String> logData = sc.textFile(logFile).cache();

    long numAs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("a"); }
    }).count();

    long numBs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("b"); }
    }).count();

    System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
  }
}

This program just counts the number of lines containing 'a' and the number containing 'b' in a text file. 该程序只计算包含'a'的行数和包含文本文件中'b'的数字。 Note that you'll need to replace $YOUR_SPARK_HOME with the location where Spark is installed. 请注意,您需要将$ YOUR_SPARK_HOME替换为安装Spark的位置。 As with the Scala example, we initialize a SparkContext , though we use the special JavaSparkContext class to get a Java-friendly one. 与Scala示例一样,我们初始化SparkContext ,尽管我们使用特殊的JavaSparkContext类来获得Java友好的类。 We also create RDDs (represented by JavaRDD) and run transformations on them. 我们还创建了RDD(由JavaRDD表示)并对它们进行转换。 Finally, we pass functions to Spark by creating classes that extend spark.api.java.function.Function. 最后,我们通过创建扩展spark.api.java.function.Function的类将函数传递给Spark。 The Java programming guide describes these differences in more detail. Java编程指南更详细地描述了这些差异。

To build the program, we also write a Maven pom.xml file that lists Spark as a dependency. 为了构建程序,我们还编写了一个Maven pom.xml文件,该文件将Spark列为依赖项。 Note that Spark artifacts are tagged with a Scala version. 请注意,Spark工件使用Scala版本标记。

<project>
  <groupId>edu.berkeley</groupId>
  <artifactId>simple-project</artifactId>
  <modelVersion>4.0.0</modelVersion>
  <name>Simple Project</name>
  <packaging>jar</packaging>
  <version>1.0</version>
  <repositories>
    <repository>
      <id>Akka repository</id>
      <url>http://repo.akka.io/releases</url>
    </repository>
  </repositories>
  <dependencies>
    <dependency> <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>0.9.0-incubating</version>
    </dependency>
  </dependencies>
</project>

If you also wish to read data from Hadoop's HDFS, you will also need to add a dependency on hadoop-client for your version of HDFS: 如果您还希望从Hadoop的HDFS读取数据,您还需要为您的HDFS版本添加对hadoop-client的依赖:

<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-client</artifactId>
  <version>...</version>
</dependency>

We lay out these files according to the canonical Maven directory structure: 我们根据规范的Maven目录结构布置这些文件:

$ find .
./pom.xml
./src
./src/main
./src/main/java
./src/main/java/SimpleApp.java

Now, we can execute the application using Maven: 现在,我们可以使用Maven执行应用程序:

$ mvn package
$ mvn exec:java -Dexec.mainClass="SimpleApp"
...
Lines with a: 46, Lines with b: 23

And then follow the steps from Launching Spark on YARN : 然后按照在YARN上启动Spark的步骤:

Building a YARN-Enabled Assembly JAR 构建支持YARN的程序集JAR

We need a consolidated Spark JAR (which bundles all the required dependencies) to run Spark jobs on a YARN cluster. 我们需要一个整合的Spark JAR(它捆绑了所有必需的依赖项)来在YARN集群上运行Spark作业。 This can be built by setting the Hadoop version and SPARK_YARN environment variable, as follows: 这可以通过设置Hadoop版本和SPARK_YARN环境变量来构建,如下所示:

SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

The assembled JAR will be something like this: ./assembly/target/scala-2.10/spark-assembly_0.9.0-incubating-hadoop2.0.5.jar. 组装好的JAR将是这样的:./assembly/target/scala-2.10/spark-assembly_0.9.0-incubating-hadoop2.0.5.jar。

The build process now also supports new YARN versions (2.2.x). 构建过程现在还支持新的YARN版本(2.2.x)。 See below. 见下文。

Preparations 准备工作

  • Building a YARN-enabled assembly (see above). 构建支持YARN的程序集(参见上文)。
  • The assembled jar can be installed into HDFS or used locally. 组装好的罐子可以安装到HDFS中或在本地使用。
  • Your application code must be packaged into a separate JAR file. 您的应用程序代码必须打包到单独的JAR文件中。

If you want to test out the YARN deployment mode, you can use the current Spark examples. 如果要测试YARN部署模式,可以使用当前的Spark示例。 A spark-examples_2.10-0.9.0-incubating file can be generated by running: 可以通过运行以下命令生成spark-examples_2.10-0.9.0-incubating文件:

sbt/sbt assembly 

NOTE: since the documentation you're reading is for Spark version 0.9.0-incubating, we are assuming here that you have downloaded Spark 0.9.0-incubating or checked it out of source control. 注意:由于您正在阅读的文档是针对Spark版本0.9.0孵化的,因此我们假设您已经下载了Spark 0.9.0孵化或从源代码管理中检出它。 If you are using a different version of Spark, the version numbers in the jar generated by the sbt package command will obviously be different. 如果您使用的是不同版本的Spark,则sbt package命令生成的jar中的版本号显然会有所不同。

Configuration 组态

Most of the configs are the same for Spark on YARN as other deploys. 对于Spark on YARN,大多数配置与其他部署相同。 See the Configuration page for more information on those. 有关这些信息,请参阅“配置”页面。 These are configs that are specific to SPARK on YARN. 这些是特定于SPAR在YARN上的配置。

Environment variables: 环境变量:

  • SPARK_YARN_USER_ENV , to add environment variables to the Spark processes launched on YARN. SPARK_YARN_USER_ENV ,用于将环境变量添加到YARN上启动的Spark进程。 This can be a comma separated list of environment variables, eg 这可以是逗号分隔的环境变量列表,例如
SPARK_YARN_USER_ENV="JAVA_HOME=/jdk64,FOO=bar"

System Properties: 系统属性:

  • spark.yarn.applicationMaster.waitTries , property to set the number of times the ApplicationMaster waits for the the spark master and then also the number of tries it waits for the Spark Context to be intialized. spark.yarn.applicationMaster.waitTries ,属性,用于设置ApplicationMaster等待spark主服务器的次数,然后还有等待Spark Context进行初始化的尝试次数。 Default is 10. 默认值为10。
  • spark.yarn.submit.file.replication , the HDFS replication level for the files uploaded into HDFS for the application. spark.yarn.submit.file.replication ,为应用程序上传到HDFS的文件的HDFS复制级别。 These include things like the spark jar, the app jar, and any distributed cache files/archives. 这些包括火花罐,应用程序jar和任何分布式缓存文件/存档。
  • spark.yarn.preserve.staging.files , set to true to preserve the staged files(spark jar, app jar, distributed cache files) at the end of the job rather then delete them. spark.yarn.preserve.staging.files ,设置为true以在作业结束时保留暂存文件(spark jar,app jar,分布式缓存文件),而不是删除它们。
  • spark.yarn.scheduler.heartbeat.interval-ms , the interval in ms in which the Spark application master heartbeats into the YARN ResourceManager. spark.yarn.scheduler.heartbeat.interval-ms ,Spark应用程序主服务器心跳到YARN ResourceManager的时间间隔(毫秒)。 Default is 5 seconds. 默认值为5秒。
  • spark.yarn.max.worker.failures , the maximum number of worker failures before failing the application. spark.yarn.max.worker.failures ,失败应用程序之前的最大工作失败数。 Default is the number of workers requested times 2 with minimum of 3. 默认值是请求的工作人员数量2,最少为3。

Launching Spark on YARN 在YARN上启动Spark

Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the hadoop cluster. 确保HADOOP_CONF_DIRYARN_CONF_DIR指向包含hadoop集群的(客户端)配置文件的目录。 This would be used to connect to the cluster, write to the dfs and submit jobs to the resource manager. 这将用于连接到群集,写入dfs并将作业提交给资源管理器。

There are two scheduler mode that can be used to launch spark application on YARN. 有两种调度程序模式可用于在YARN上启动spark应用程序。

Launch spark application by YARN Client with yarn-standalone mode. 由YARN Client以yarn-standalone模式启动spark应用程序。

The command to launch the YARN Client is as follows: 启动YARN客户端的命令如下:

SPARK_JAR=<SPARK_ASSEMBLY_JAR_FILE> ./bin/spark-class org.apache.spark.deploy.yarn.Client \
  --jar <YOUR_APP_JAR_FILE> \
  --class <APP_MAIN_CLASS> \
  --args <APP_MAIN_ARGUMENTS> \
  --num-workers <NUMBER_OF_WORKER_MACHINES> \
  --master-class <ApplicationMaster_CLASS>
  --master-memory <MEMORY_FOR_MASTER> \
  --worker-memory <MEMORY_PER_WORKER> \
  --worker-cores <CORES_PER_WORKER> \
  --name <application_name> \
  --queue <queue_name> \
  --addJars <any_local_files_used_in_SparkContext.addJar> \
  --files <files_for_distributed_cache> \
  --archives <archives_for_distributed_cache>

For example: 例如:

# Build the Spark assembly JAR and the Spark examples JAR
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

# Configure logging
$ cp conf/log4j.properties.template conf/log4j.properties

# Submit Spark's ApplicationMaster to YARN's ResourceManager, and instruct Spark to run the SparkPi example
$ SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.0.5-alpha.jar \
    ./bin/spark-class org.apache.spark.deploy.yarn.Client \
      --jar examples/target/scala-2.10/spark-examples-assembly-0.9.0-incubating.jar \
      --class org.apache.spark.examples.SparkPi \
      --args yarn-standalone \
      --num-workers 3 \
      --master-memory 4g \
      --worker-memory 2g \
      --worker-cores 1

# Examine the output (replace $YARN_APP_ID in the following with the "application identifier" output by the previous command)
# (Note: YARN_APP_LOGS_DIR is usually /tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version.)
$ cat $YARN_APP_LOGS_DIR/$YARN_APP_ID/container*_000001/stdout
Pi is roughly 3.13794

The above starts a YARN Client programs which start the default Application Master. 以上启动YARN客户端程序,启动默认的Application Master。 Then SparkPi will be run as a child thread of Application Master, YARN Client will periodically polls the Application Master for status updates and displays them in the console. 然后,SparkPi将作为Application Master的子线程运行,YARN Client将定期轮询Application Master以获取状态更新并在控制台中显示它们。 The client will exit once your application has finished running. 应用程序运行完毕后,客户端将退出。

With this mode, your application is actually run on the remote machine where the Application Master is run upon. 使用此模式,您的应用程序实际上在运行Application Master的远程计算机上运行。 Thus application that involve local interaction will not work well, eg spark-shell. 因此,涉及本地交互的应用程序将无法正常工作,例如spark-shell。

I had the same question a few days ago and yesterday managed to solve it. 几天前我有同样的问题,昨天设法解决了。
That's what I've done: 这就是我所做的:

  1. Download sbt and unzip and untar it :http://www.scala-sbt.org/download.html 下载sbt并解压缩并解压缩:http://www.scala-sbt.org/download.html
  2. I have downloaded Spark Prebuild package for Hadoop 2, unzipped and untarred it: http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz 我已经为Hadoop 2下载了Spark Prebuild软件包,解压缩并解压缩它: http//www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
  3. I've created standalone application SimpleApp.scala as described in: http://spark.apache.org/docs/latest/quick-start.html#standalone-applications with proper simple.sbt file (just copied from the description) and proper directory layout 我创建了独立应用程序SimpleApp.scala,如下所述: http ://spark.apache.org/docs/latest/quick-start.html#standalone-applications with simple simple.sbt file(刚刚从描述中复制)和正确的目录布局
  4. Make sure you have sbt in you PATH. 确保你在路径中有sbt。 Go to directory with your application and build your package using sbt package 转到您的应用程序目录并使用sbt package构建您的sbt package
  5. Start Spark Server using SPARK_HOME_DIR/sbin/spark_master.sh 使用SPARK_HOME_DIR/sbin/spark_master.sh启动Spark Server
  6. Go to localhost:8080 and make sure your server is running. 转到localhost:8080并确保您的服务器正在运行。 Copy link from URL (from server description, not localhost. It shoul be something with port 7077 or similiar) 从URL复制链接(来自服务器描述,而不是localhost。它应该是端口7077或类似的东西)
  7. Start Workers using SPARK_HOME_DIR/bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT where IP:PORT is the URL copied in 6 使用SPARK_HOME_DIR/bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT开始工作者SPARK_HOME_DIR/bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT其中IP:PORT是在6中复制的URL
  8. Deploy you application to the server: SPARK_HOME_DIR/bin/spark-submit --class "SimpleApp" --master URL target/scala-2.10/simple-project_2.10-1.0.jar 将应用程序部署到服务器: SPARK_HOME_DIR/bin/spark-submit --class "SimpleApp" --master URL target/scala-2.10/simple-project_2.10-1.0.jar

That's worked for me and hope will help you. 这对我有用,希望对你有所帮助。
Pawel 帕维尔

Aditionally to the selected answer, if you want to connect to an external standalone Spark instance: 对于所选答案,如果要连接到外部独立Spark实例,请执行以下操作:

SparkConf conf =
new SparkConf()
     .setAppName("Simple Application")
     .setMaster("spark://10.3.50.139:7077");

JavaSparkContext sc = new JavaSparkContext(conf);

Here you can find more "master" configuration depending on where Spark is running: http://spark.apache.org/docs/latest/submitting-applications.html#master-urls 在这里,您可以找到更多“主”配置,具体取决于Spark的运行位置: http//spark.apache.org/docs/latest/submitting-applications.html#master-urls

This answer is for Spark 2.3.If you want to test your Spark application locally, ie without the pre-requisite of a Hadoop cluster, and even without having to start any of the standalone Spark services, you could do this: 这个答案适用于Spark 2.3。如果您想在本地测试您的Spark应用程序,即没有Hadoop集群的先决条件,甚至无需启动任何独立的Spark服务,您可以这样做:

JavaSparkContext jsc = new JavaSparkContext(new SparkConf().setAppName("Simple App"));

And then, to run your application locally: 然后,在本地运行您的应用程序:

$SPARK_HOME/bin/spark-submit --class SimpleApp --master local target/scala-2.10/simple-project_2.10-1.0.jar

For this to work , you just need to extract the Spark tar file into $SPARK_HOME, and set $SPARK_HOME into the Spark user's .profile 为此,您需要将Spark tar文件解压缩到$ SPARK_HOME,并将$ SPARK_HOME设置为Spark用户的.profile

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM