简体   繁体   English

如何使用Google DataProc Java客户端在相关的GS存储桶中使用jar文件和类提交spark作业?

[英]How do you use the Google DataProc Java Client to submit spark jobs using jar files and classes in associated GS bucket?

I need to trigger Spark Jobs to aggregate data from a JSON file using an API call. 我需要触发Spark Jobs以使用API​​调用从JSON文件聚合数据。 I use spring-boot to create the resources. 我使用spring-boot来创建资源。 Thus, the steps for the solution is the following: 因此,解决方案的步骤如下:

  1. User makes an POST request with a json file as the input 用户使用json文件作为输入发出POST请求
  2. The JSON file is stored in google bucket associated with dataproc cluster. JSON文件存储在与数据中心群集关联的Google存储桶中。
  3. A aggregating spark job is triggered from within the REST method with the specified jars, classes and the argument is the json file link. 在REST方法中使用指定的jar,类和参数触发聚合spark作业是json文件链接。

I want the job to be triggered using Dataproc's Java Client instead of console or command line. 我希望使用Dataproc的Java Client而不是控制台或命令行来触发作业。 How do you do it? 你怎么做呢?

We're hoping to have a more thorough guide shortly on the official documentation , but to get started, visit the following API overview: https://developers.google.com/api-client-library/java/apis/dataproc/v1 我们希望很快就会在官方文档中提供更全面的指南,但要开始使用,请访问以下API概述: https//developers.google.com/api-client-library/java/apis/dataproc/v1

It includes links to the Dataproc javadocs ; 它包括Dataproc javadocs的链接; if your server is making calls on behalf of your own project and not on behalf of your end-users' Google projects, then you probably want the keyfile-based service-account auth explained here to create the Credential object you use to initialize the Dataproc client stub. 如果您的服务器代表您自己的项目而不是代表您的最终用户的Google项目进行调用,那么您可能希望此处解释的基于密钥文件的服务帐户身份验证创建用于初始化DataprocCredential对象客户端存根。

As for the dataproc-specific parts, this just means you add the following dependency to your Maven pomfile if using Maven: 至于dataproc特定的部分,这只是意味着如果使用Maven,则将以下依赖项添加到Maven pom文件中:

<project>
  <dependencies>
    <dependency>
      <groupId>com.google.apis</groupId>
      <artifactId>google-api-services-dataproc</artifactId>
      <version>v1-rev4-1.21.0</version>
    </dependency>
  </dependencies>
</project>

And then you'll have code like: 然后你会得到如下代码:

Dataproc dataproc = new Dataproc.Builder(new NetHttpTransport(), new JacksonFactory(), credential)
    .setApplicationName("my-webabb/1.0")
    .build();
dataproc.projects().regions().jobs().submit(
    projectId, "global", new SubmitJobRequest()
        .setJob(new Job()
            .setPlacement(new JobPlacement()
                .setClusterName("my-spark-cluster"))
            .setSparkJob(new SparkJob()
                .setMainClass("FooSparkJobMain")
                .setJarFileUris(ImmutableList.of("gs://bucket/path/to/your/spark-job.jar"))
                .setArgs(ImmutableList.of(
                    "arg1", "arg2", "arg3")))))
    .execute();

Since different intermediary servers may do low-level retries or your request may throw an IOException where you don't know whether the job-submission succeeded or not, an addition step you may want to take is to generate your own jobId ; 由于不同的中间服务器可能会进行低级重试,或者您的请求可能会抛出IOException而您不知道作业提交是否成功,您可能想要采取的一个额外步骤是生成您自己的jobId ; then you know what jobId to poll on to figure out if it got submitted even if your request times out or throws some unknown exception: 那么你知道什么jobId要进行轮询以确定它是否已经提交,即使你的请求超时或抛出一些未知的异常:

import java.util.UUID;

...

Dataproc dataproc = new Dataproc.Builder(new NetHttpTransport(), new JacksonFactory(), credential)
    .setApplicationName("my-webabb/1.0")
    .build();

String curJobId = "json-agg-job-" + UUID.randomUUID().toString();
Job jobSnapshot = null;
try {
  jobSnapshot = dataproc.projects().regions().jobs().submit(
      projectId, "global", new SubmitJobRequest()
          .setJob(new Job()
              .setReference(new JobReference()
                   .setJobId(curJobId))
              .setPlacement(new JobPlacement()
                  .setClusterName("my-spark-cluster"))
              .setSparkJob(new SparkJob()
                  .setMainClass("FooSparkJobMain")
                  .setJarFileUris(ImmutableList.of("gs://bucket/path/to/your/spark-job.jar"))
                  .setArgs(ImmutableList.of(
                      "arg1", "arg2", "arg3")))))
      .execute();
} catch (IOException ioe) {
  try {
    jobSnapshot = dataproc.projects().regions().jobs().get(
        projectId, "global", curJobId).execute();
    logger.info(ioe, "Despite exception, job was verified submitted");
  } catch (IOException ioe2) {
    // Handle differently; if it's a GoogleJsonResponseException you can inspect the error
    // code, and if it's a 404, then it means the job didn't get submitted; you can add retry
    // logic in that case.
  }
}

// We can poll on dataproc.projects().regions().jobs().get(...) until the job reports being
// completed or failed now.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM