简体   繁体   中英

Spark run multiple operations in parallel

I have a Spark application picks a subset and do some operation on the subset. There is no dependency & interaction between each subset and its operation, so I tried to use multi threads to let them run parallel to improve performance. The code looks like below:

Dataset<Row> fullData = sparkSession.read().json("some_path");
ExecutorService executor = Executors.newFixedThreadPool(10);
List<Runnable> tasks = Lists.newArrayList();
for (int i = 1; i <= 50; i++) {
    final int x = i;
    tasks.add(() -> {
        Dataset<Row> subset_1 = fullData.filter(length(col("name")).equalTo(x));
        Dataset<Row> subset_2 = fullData.filter(length(col("name")).equalTo(x));
        Dataset<Row> result = subset_1.join(subset_2, ...);
        log.info("Res size is " + result.count()); // force Spark do the join operation
    });
}
CompletableFuture<?>[] futures = tasks.stream()
    .map(task -> CompletableFuture.runAsync(task, executor))
    .toArray(CompletableFuture[]::new);
CompletableFuture.allOf(futures).join();
executor.shutdown();

From Spark job management UI, I noticed those 50 tasks are submitted in parallel, but the the processing is still in a blocking way, one task starts running until another task is complete. How can I make the multiple tasks run in parallel instead of one after another?

This is not how you control parallelism in Spark. It's all controlled declaratively via configuration.

Spark is a distributed computing framework and it's meant to be used in a distributed environment where each worker is ran single threaded. Usually tasks are scheduled using Yarn which has metadata of nodes and may will start multiple tasks on a single node (depending on memory and cpu constraints) but in separate jvms.

In local mode you can have multiple workers realized as separate threads, so if you say master("local[8]") you will get 8 workers each running as a thread in a single jvm.

How are you running your application?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM