How to process the different location of multiple files in parellel using spark with java?

Question

can anybody help me to How to process the different location of multiple files in parallel using spark with java.

can anybody share the sample code.

Answer 1

Create Callable class
This would the load of dataframe from the file and cache it
Driver class would Create an Executor Service with n thread where n>1
Create list of Callable task
Submit the list of call task using executor.invokeAll

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.SparkSession;

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
import java.util.stream.Collectors;

public class LoadDataFrameInParallel {

    public static void main(String[] args) throws InterruptedException {
        List<String> filPaths = new ArrayList<>();
        SparkSession spark = Constant.getSparkSess();
        List<LoadDataFrame> futureList =  filPaths.stream().map( filePath -> new LoadDataFrame(spark,filePath)).collect(Collectors.toList());
        ExecutorService executor = Executors.newFixedThreadPool(filPaths.size());
        List<Future<Dataset<String>>> output =  executor.invokeAll(futureList);
        output.forEach(future -> {
            try {
                future.get().count();
            } catch (InterruptedException | ExecutionException e) {
                e.printStackTrace();
            }
        });
    }

}

class LoadDataFrame implements Callable<Dataset<String>> {

    private final SparkSession spark;
    private final String filePath;

    public LoadDataFrame(SparkSession spark, String filePath) {
        this.spark = spark;
        this.filePath = filePath;
    }

    @Override
    public Dataset<String> call() {
        return spark.read().textFile(filePath).cache();
    }}

Answer 2

Please note below code is in scala .

I have kept sample files in different folder each file will have one column.

// Files are available in different directories.

scala> "tree /tmp/data".!
/tmp/data
├── dira
│   └── part-00000-bc205041-1c5e-4a49-a55f-50df2a223bea-c000.snappy.orc
├── dirb
│   ├── dirba
│   │   └── part-00000-76b89d6f-9874-4b7e-828e-542e0f133af8-c000.snappy.orc
│   └── part-00000-5368c87e-f09e-4ef4-ab4a-687476ebf128-c000.snappy.orc
├── dirc
│   └── part-00000-c1d17495-e9c7-46c4-a11e-9c02ffa9c170-c000.snappy.orc
└── dird
    ├── dire
    │   └── part-00000-e78d344a-c0c8-4ec3-bef5-2bf556239e90-c000.snappy.orc
    └── part-00000-d89271e0-8689-4926-af10-579dff74b752-c000.snappy.orc

6 directories, 12 files

scala> val paths = Seq("/tmp/data/dira","/tmp/data/dirb","/tmp/data/dirb/dirba","tmp/data/dirc","/tmp/data/dird","/tmp/data/dird/dire") // Get these list dynamic, right now i am adding static list
paths: Seq[String] = List(/tmp/data/dira, /tmp/data/dirb, /tmp/data/dirb/dirba, tmp/data/dirc, /tmp/data/dird, /tmp/data/dird/dire)

scala> val dfr = spark.read.format("orc") // Creating DataFrameReader object once.
dfr: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@23d34ed

scala> :paste
// Entering paste mode (ctrl-D to finish)

spark.time {
  // using par to process parallel.
  paths.zipWithIndex.par.map{path => {
        println(s" Processing file - ${path._1} & Index - ${path._2}") // To show you parallel
        dfr.load(path._1)
    }
  }.reduce(_ union _).show(10,false)
}

// Exiting paste mode, now interpreting.

 Processing file - /tmp/data/dira & Index - 0
 Processing file - tmp/data/dirc & Index - 3
 Processing file - /tmp/data/dirb/dirba & Index - 2
 Processing file - /tmp/data/dirb & Index - 1
 Processing file - /tmp/data/dird & Index - 4
 Processing file - /tmp/data/dird/dire & Index - 5
+----+
|id  |
+----+
|1001|
|1002|
|1003|
|1004|
|1005|
|1006|
|1007|
|1008|
|1009|
|1010|
+----+
only showing top 10 rows

Time taken: 446 ms

scala>

How to process the different location of multiple files in parellel using spark with java?

Question

2 answers

solution1
1 2020-04-29 16:27:08

solution2
0 2020-04-29 16:45:54

How to process the different location of multiple files in parellel using spark with java?

Question

2 answers

solution1 1 2020-04-29 16:27:08

solution2 0 2020-04-29 16:45:54

solution1
1 2020-04-29 16:27:08

solution2
0 2020-04-29 16:45:54