简体   繁体   中英

How to process the different location of multiple files in parellel using spark with java?

can anybody help me to How to process the different location of multiple files in parallel using spark with java.

can anybody share the sample code.

  • Create Callable class
  • This would the load of dataframe from the file and cache it
  • Driver class would Create an Executor Service with n thread where n>1
  • Create list of Callable task
  • Submit the list of call task using executor.invokeAll
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.SparkSession;

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
import java.util.stream.Collectors;

public class LoadDataFrameInParallel {

    public static void main(String[] args) throws InterruptedException {
        List<String> filPaths = new ArrayList<>();
        SparkSession spark = Constant.getSparkSess();
        List<LoadDataFrame> futureList =  filPaths.stream().map( filePath -> new LoadDataFrame(spark,filePath)).collect(Collectors.toList());
        ExecutorService executor = Executors.newFixedThreadPool(filPaths.size());
        List<Future<Dataset<String>>> output =  executor.invokeAll(futureList);
        output.forEach(future -> {
            try {
                future.get().count();
            } catch (InterruptedException | ExecutionException e) {
                e.printStackTrace();
            }
        });
    }

}

class LoadDataFrame implements Callable<Dataset<String>> {

    private final SparkSession spark;
    private final String filePath;

    public LoadDataFrame(SparkSession spark, String filePath) {
        this.spark = spark;
        this.filePath = filePath;
    }

    @Override
    public Dataset<String> call() {
        return spark.read().textFile(filePath).cache();
    }}

Please note below code is in scala .

I have kept sample files in different folder each file will have one column.

// Files are available in different directories.

scala> "tree /tmp/data".!
/tmp/data
├── dira
│   └── part-00000-bc205041-1c5e-4a49-a55f-50df2a223bea-c000.snappy.orc
├── dirb
│   ├── dirba
│   │   └── part-00000-76b89d6f-9874-4b7e-828e-542e0f133af8-c000.snappy.orc
│   └── part-00000-5368c87e-f09e-4ef4-ab4a-687476ebf128-c000.snappy.orc
├── dirc
│   └── part-00000-c1d17495-e9c7-46c4-a11e-9c02ffa9c170-c000.snappy.orc
└── dird
    ├── dire
    │   └── part-00000-e78d344a-c0c8-4ec3-bef5-2bf556239e90-c000.snappy.orc
    └── part-00000-d89271e0-8689-4926-af10-579dff74b752-c000.snappy.orc

6 directories, 12 files

scala> val paths = Seq("/tmp/data/dira","/tmp/data/dirb","/tmp/data/dirb/dirba","tmp/data/dirc","/tmp/data/dird","/tmp/data/dird/dire") // Get these list dynamic, right now i am adding static list
paths: Seq[String] = List(/tmp/data/dira, /tmp/data/dirb, /tmp/data/dirb/dirba, tmp/data/dirc, /tmp/data/dird, /tmp/data/dird/dire)

scala> val dfr = spark.read.format("orc") // Creating DataFrameReader object once.
dfr: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@23d34ed

scala> :paste
// Entering paste mode (ctrl-D to finish)

spark.time {
  // using par to process parallel.
  paths.zipWithIndex.par.map{path => {
        println(s" Processing file - ${path._1} & Index - ${path._2}") // To show you parallel
        dfr.load(path._1)
    }
  }.reduce(_ union _).show(10,false)
}

// Exiting paste mode, now interpreting.

 Processing file - /tmp/data/dira & Index - 0
 Processing file - tmp/data/dirc & Index - 3
 Processing file - /tmp/data/dirb/dirba & Index - 2
 Processing file - /tmp/data/dirb & Index - 1
 Processing file - /tmp/data/dird & Index - 4
 Processing file - /tmp/data/dird/dire & Index - 5
+----+
|id  |
+----+
|1001|
|1002|
|1003|
|1004|
|1005|
|1006|
|1007|
|1008|
|1009|
|1010|
+----+
only showing top 10 rows

Time taken: 446 ms

scala>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM