简体   繁体   中英

Writing multiple parquet files in parallel

I have a big Spark DataSet (Java) & I need to apply filter to get multiple dataset and write each dataset to a parquet.

Does Java Spark provide any feature where it can write all parquet files in parallel? I am trying to avoid it to do it sequentially.

Other option is use Java Thread , is there any other way to do it?

Spark will automatically write parquet files in parallel. It also depends on how many executor cores you provided as well as number of partitions of a dataframe. You can try using df.write.parquet("/location/to/hdfs") and see the time when those were written.

Yes by default Spark provide parallelism using Spark Executors, but if want to achieve parallelism on Driver too, then you can do something like:

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

import java.util.ArrayList;
import java.util.List;

public class ParallelSparkWrite {

    public static void main(String[] args) {
        SparkSession spark = Constant.getSparkSess();

        Dataset<Row> ds = spark.read().json("input/path");

        List<String> filterValue = new ArrayList<>();

        //Create a parallel stream
        filterValue.parallelStream()
                .forEach(filter -> {
                    //Filter your DataSet and write in parallel 
            ds.filter(ds.col("col1").equalTo(filter)).write().json("/output/path/"+filter+".json");
        });


    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM