从 Java 提交 Azure 突触中的 Spark 作业

Question

Azure Synapse provides managed spark pool, where the spark jobs can be submitted. Azure Synapse 提供托管火花池，可以在其中提交火花作业。

How do submit spark-job (as jars) along with dependencies to the pool2 using Java如何使用 Java 将 spark-job（作为 jars）以及依赖项提交到 pool2
If multiple jobs are submitted (each along with its own set of dependencies), then are the dependencies shared across the jobs.如果提交了多个作业（每个作业都有自己的一组依赖项），那么依赖项会在作业之间共享。 Or are they agnostic of each other?还是他们彼此不可知？

Answer 1

For (1):对于 (1)：

Add the following dependency:添加以下依赖项：

    <dependency>
        <groupId>com.azure</groupId>
        <artifactId>azure-analytics-synapse-spark</artifactId>
        <version>1.0.0-beta.4</version>
    </dependency>
    <dependency>
        <groupId>com.azure</groupId>
        <artifactId>azure-identity</artifactId>
    </dependency>

With below sample code:使用以下示例代码：

import com.azure.analytics.synapse.spark.SparkBatchClient;
import com.azure.analytics.synapse.spark.SparkClientBuilder;
import com.azure.analytics.synapse.spark.models.SparkBatchJob;
import com.azure.analytics.synapse.spark.models.SparkBatchJobOptions;
import com.azure.identity.DefaultAzureCredentialBuilder;

import java.util.*;

public class SynapseService {
    private final SparkBatchClient batchClient;

    public SynapseService() {
        batchClient = new SparkClientBuilder()
                .endpoint("https://xxxx.dev.azuresynapse.net/")
                .sparkPoolName("TestPool")
                .credential(new DefaultAzureCredentialBuilder().build())
                .buildSparkBatchClient();
    }

    public SparkBatchJob submitSparkJob(String name, String mainFile, String mainClass, List<String> arguments, List<String> jars) {
        SparkBatchJobOptions options = new SparkBatchJobOptions()
                .setName(name)
                .setFile(mainFile)
                .setClassName(mainClass)
                .setArguments(arguments)
                .setJars(jars)
                .setExecutorCount(3)
                .setExecutorCores(4)
                .setDriverCores(4)
                .setDriverMemory("6G")
                .setExecutorMemory("6G");
        return batchClient.createSparkBatchJob(options);
    }

    /**
     * All possible Livy States: https://docs.microsoft.com/en-us/rest/api/synapse/data-plane/spark-batch/get-spark-batch-jobs#livystates
     *
     * Some of the values: busy, dead, error, idle, killed, not_Started, recovering, running, shutting_down, starting, success
     * @param id
     * @return
     */
    public SparkBatchJob getSparkJob(int id, boolean detailed) {
        return batchClient.getSparkBatchJob(id, detailed);
    }


    /**
     * Cancels the ongoing synapse spark job
     * @param jobId id of the synapse job
     */
    public void cancelSparkJob(int jobId) {
        batchClient.cancelSparkBatchJob(jobId);
    }

}

And finally submit the spark-job:最后提交 spark-job：

SynapseService synapse = new SynapseService();
synapse.submitSparkJob("TestJob",
        "abfss://builds@xxxx.dfs.core.windows.net/core/jars/main-module_2.12-1.0.jar",
        "com.xx.Main",
        Collections.emptyList(),
        Arrays.asList("abfss://builds@xxxx.dfs.core.windows.net/core/jars/*"));

Finally, you will need to provide the necessary role in:最后，您需要在以下方面提供必要的角色：

Open Synapse Analytics Studio打开 Synapse 分析工作室
Manage -> Access Control管理 -> 访问控制
Provide the role Synapse Compute Operator and Synapse Compute Operator to the caller向调用者提供角色Synapse Compute Operator和Synapse Compute Operator

To answer question-2:回答问题 2：

When jobs are submitted in synapse via jars, they are equivalent to spark-submit .当作业通过 jars 在 synapse 中提交时，它们相当于spark-submit 。 So all the jobs are agnostic of each other and do not share each other's dependencies.所以所有的工作都是彼此不可知的，并且不共享彼此的依赖关系。

从 Java 提交 Azure 突触中的 Spark 作业

问题描述

1 个解决方案

解决方案1
0 2022-02-02 14:05:33

从 Java 提交 Azure 突触中的 Spark 作业

问题描述

1 个解决方案

解决方案1 0 2022-02-02 14:05:33

解决方案1
0 2022-02-02 14:05:33