Writing multiple parquet files in parallel

Question

I have a big Spark DataSet (Java) & I need to apply filter to get multiple dataset and write each dataset to a parquet.

Does Java Spark provide any feature where it can write all parquet files in parallel? I am trying to avoid it to do it sequentially.

Other option is use Java Thread , is there any other way to do it?

Answer 1

Spark will automatically write parquet files in parallel. It also depends on how many executor cores you provided as well as number of partitions of a dataframe. You can try using df.write.parquet("/location/to/hdfs") and see the time when those were written.

Answer 2

Yes by default Spark provide parallelism using Spark Executors, but if want to achieve parallelism on Driver too, then you can do something like:

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

import java.util.ArrayList;
import java.util.List;

public class ParallelSparkWrite {

    public static void main(String[] args) {
        SparkSession spark = Constant.getSparkSess();

        Dataset<Row> ds = spark.read().json("input/path");

        List<String> filterValue = new ArrayList<>();

        //Create a parallel stream
        filterValue.parallelStream()
                .forEach(filter -> {
                    //Filter your DataSet and write in parallel 
            ds.filter(ds.col("col1").equalTo(filter)).write().json("/output/path/"+filter+".json");
        });


    }
}

Writing multiple parquet files in parallel

Question

2 answers

solution1
1 ACCPTED 2020-06-04 15:47:51

solution2
1 2020-06-04 18:02:24

Writing multiple parquet files in parallel

Question

2 answers

solution1 1 ACCPTED 2020-06-04 15:47:51

solution2 1 2020-06-04 18:02:24

solution1
1 ACCPTED 2020-06-04 15:47:51

solution2
1 2020-06-04 18:02:24