简体   繁体   English

并行编写多个 parquet 文件

[英]Writing multiple parquet files in parallel

I have a big Spark DataSet (Java) & I need to apply filter to get multiple dataset and write each dataset to a parquet.我有一个大的 Spark DataSet (Java) & 我需要应用过滤器来获取多个数据集并将每个数据集写入一个镶木地板。

Does Java Spark provide any feature where it can write all parquet files in parallel? Java Spark 是否提供任何可以并行写入所有 parquet 文件的功能? I am trying to avoid it to do it sequentially.我试图避免它按顺序进行。

Other option is use Java Thread , is there any other way to do it?其他选项是使用 Java Thread ,还有其他方法吗?

Spark will automatically write parquet files in parallel. Spark 会自动并行写入 parquet 文件。 It also depends on how many executor cores you provided as well as number of partitions of a dataframe.它还取决于您提供的执行器内核数量以及 dataframe 的分区数量。 You can try using df.write.parquet("/location/to/hdfs") and see the time when those were written.您可以尝试使用df.write.parquet("/location/to/hdfs")并查看写入时间。

Yes by default Spark provide parallelism using Spark Executors, but if want to achieve parallelism on Driver too, then you can do something like:是的,默认情况下,Spark 使用 Spark Executors 提供并行性,但如果也想在 Driver 上实现并行性,那么您可以执行以下操作:

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

import java.util.ArrayList;
import java.util.List;

public class ParallelSparkWrite {

    public static void main(String[] args) {
        SparkSession spark = Constant.getSparkSess();

        Dataset<Row> ds = spark.read().json("input/path");

        List<String> filterValue = new ArrayList<>();

        //Create a parallel stream
        filterValue.parallelStream()
                .forEach(filter -> {
                    //Filter your DataSet and write in parallel 
            ds.filter(ds.col("col1").equalTo(filter)).write().json("/output/path/"+filter+".json");
        });


    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM