Spark并行运行多个操作

Question

I have a Spark application picks a subset and do some operation on the subset. 我有一个Spark应用程序选择一个子集并对该子集执行一些操作。 There is no dependency & interaction between each subset and its operation, so I tried to use multi threads to let them run parallel to improve performance. 每个子集及其操作之间没有依赖关系和交互作用，因此我尝试使用多线程让它们并行运行以提高性能。 The code looks like below: 代码如下所示：

Dataset<Row> fullData = sparkSession.read().json("some_path");
ExecutorService executor = Executors.newFixedThreadPool(10);
List<Runnable> tasks = Lists.newArrayList();
for (int i = 1; i <= 50; i++) {
    final int x = i;
    tasks.add(() -> {
        Dataset<Row> subset_1 = fullData.filter(length(col("name")).equalTo(x));
        Dataset<Row> subset_2 = fullData.filter(length(col("name")).equalTo(x));
        Dataset<Row> result = subset_1.join(subset_2, ...);
        log.info("Res size is " + result.count()); // force Spark do the join operation
    });
}
CompletableFuture<?>[] futures = tasks.stream()
    .map(task -> CompletableFuture.runAsync(task, executor))
    .toArray(CompletableFuture[]::new);
CompletableFuture.allOf(futures).join();
executor.shutdown();

From Spark job management UI, I noticed those 50 tasks are submitted in parallel, but the the processing is still in a blocking way, one task starts running until another task is complete. 从Spark作业管理UI中，我注意到这50个任务是并行提交的，但是处理仍然处于阻塞状态，一个任务开始运行，直到另一个任务完成。 How can I make the multiple tasks run in parallel instead of one after another? 如何使多个任务并行运行而不是一个接一个地运行？

Answer 1

This is not how you control parallelism in Spark. 这不是您在Spark中控制并行性的方式。 It's all controlled declaratively via configuration. 全部通过配置以声明方式控制。

Spark is a distributed computing framework and it's meant to be used in a distributed environment where each worker is ran single threaded. Spark是一个分布式计算框架，它应在每个工作人员都运行单线程的分布式环境中使用。 Usually tasks are scheduled using Yarn which has metadata of nodes and may will start multiple tasks on a single node (depending on memory and cpu constraints) but in separate jvms. 通常，使用具有节点元数据的Yarn计划任务，并且可能会在单个节点上启动多个任务（取决于内存和cpu约束），但在单独的jvm中。

In local mode you can have multiple workers realized as separate threads, so if you say master("local[8]") you will get 8 workers each running as a thread in a single jvm. 在local模式下，您可以将多个工作程序实现为单独的线程，因此，如果您说master("local[8]")您将获得8个工作程序，每个工作程序都在单个jvm中作为线程运行。

How are you running your application? 您如何运行您的应用程序？

Spark并行运行多个操作

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-06-16 05:30:02

Spark并行运行多个操作

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-06-16 05:30:02

解决方案1
1 已采纳 2019-06-16 05:30:02