简体   繁体   English

Spark并行运行多个操作

[英]Spark run multiple operations in parallel

I have a Spark application picks a subset and do some operation on the subset. 我有一个Spark应用程序选择一个子集并对该子集执行一些操作。 There is no dependency & interaction between each subset and its operation, so I tried to use multi threads to let them run parallel to improve performance. 每个子集及其操作之间没有依赖关系和交互作用,因此我尝试使用多线程让它们并行运行以提高性能。 The code looks like below: 代码如下所示:

Dataset<Row> fullData = sparkSession.read().json("some_path");
ExecutorService executor = Executors.newFixedThreadPool(10);
List<Runnable> tasks = Lists.newArrayList();
for (int i = 1; i <= 50; i++) {
    final int x = i;
    tasks.add(() -> {
        Dataset<Row> subset_1 = fullData.filter(length(col("name")).equalTo(x));
        Dataset<Row> subset_2 = fullData.filter(length(col("name")).equalTo(x));
        Dataset<Row> result = subset_1.join(subset_2, ...);
        log.info("Res size is " + result.count()); // force Spark do the join operation
    });
}
CompletableFuture<?>[] futures = tasks.stream()
    .map(task -> CompletableFuture.runAsync(task, executor))
    .toArray(CompletableFuture[]::new);
CompletableFuture.allOf(futures).join();
executor.shutdown();

From Spark job management UI, I noticed those 50 tasks are submitted in parallel, but the the processing is still in a blocking way, one task starts running until another task is complete. 从Spark作业管理UI中,我注意到这50个任务是并行提交的,但是处理仍然处于阻塞状态,一个任务开始运行,直到另一个任务完成。 How can I make the multiple tasks run in parallel instead of one after another? 如何使多个任务并行运行而不是一个接一个地运行?

This is not how you control parallelism in Spark. 这不是您在Spark中控制并行性的方式。 It's all controlled declaratively via configuration. 全部通过配置以声明方式控制。

Spark is a distributed computing framework and it's meant to be used in a distributed environment where each worker is ran single threaded. Spark是一个分布式计算框架,它应在每个工作人员都运行单线程的分布式环境中使用。 Usually tasks are scheduled using Yarn which has metadata of nodes and may will start multiple tasks on a single node (depending on memory and cpu constraints) but in separate jvms. 通常,使用具有节点元数据的Yarn计划任务,并且可能会在单个节点上启动多个任务(取决于内存和cpu约束),但在单独的jvm中。

In local mode you can have multiple workers realized as separate threads, so if you say master("local[8]") you will get 8 workers each running as a thread in a single jvm. local模式下,您可以将多个工作程序实现为单独的线程,因此,如果您说master("local[8]")您将获得8个工作程序,每个工作程序都在单个jvm中作为线程运行。

How are you running your application? 您如何运行您的应用程序?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM