[英]How to process multiple different files in different ways using Spring Batch
[英]How to process the different location of multiple files in parellel using spark with java?
任何人都可以帮助我如何使用带有 java 的 spark 并行处理多个文件的不同位置。
任何人都可以分享示例代码。
executor.invokeAll
提交调用任务列表import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.SparkSession;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
import java.util.stream.Collectors;
public class LoadDataFrameInParallel {
public static void main(String[] args) throws InterruptedException {
List<String> filPaths = new ArrayList<>();
SparkSession spark = Constant.getSparkSess();
List<LoadDataFrame> futureList = filPaths.stream().map( filePath -> new LoadDataFrame(spark,filePath)).collect(Collectors.toList());
ExecutorService executor = Executors.newFixedThreadPool(filPaths.size());
List<Future<Dataset<String>>> output = executor.invokeAll(futureList);
output.forEach(future -> {
try {
future.get().count();
} catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
}
});
}
}
class LoadDataFrame implements Callable<Dataset<String>> {
private final SparkSession spark;
private final String filePath;
public LoadDataFrame(SparkSession spark, String filePath) {
this.spark = spark;
this.filePath = filePath;
}
@Override
public Dataset<String> call() {
return spark.read().textFile(filePath).cache();
}}
请注意下面的代码在scala
中。
我将示例文件保存在不同的文件夹中,每个文件都有一列。
// Files are available in different directories.
scala> "tree /tmp/data".!
/tmp/data
├── dira
│ └── part-00000-bc205041-1c5e-4a49-a55f-50df2a223bea-c000.snappy.orc
├── dirb
│ ├── dirba
│ │ └── part-00000-76b89d6f-9874-4b7e-828e-542e0f133af8-c000.snappy.orc
│ └── part-00000-5368c87e-f09e-4ef4-ab4a-687476ebf128-c000.snappy.orc
├── dirc
│ └── part-00000-c1d17495-e9c7-46c4-a11e-9c02ffa9c170-c000.snappy.orc
└── dird
├── dire
│ └── part-00000-e78d344a-c0c8-4ec3-bef5-2bf556239e90-c000.snappy.orc
└── part-00000-d89271e0-8689-4926-af10-579dff74b752-c000.snappy.orc
6 directories, 12 files
scala> val paths = Seq("/tmp/data/dira","/tmp/data/dirb","/tmp/data/dirb/dirba","tmp/data/dirc","/tmp/data/dird","/tmp/data/dird/dire") // Get these list dynamic, right now i am adding static list
paths: Seq[String] = List(/tmp/data/dira, /tmp/data/dirb, /tmp/data/dirb/dirba, tmp/data/dirc, /tmp/data/dird, /tmp/data/dird/dire)
scala> val dfr = spark.read.format("orc") // Creating DataFrameReader object once.
dfr: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@23d34ed
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.time {
// using par to process parallel.
paths.zipWithIndex.par.map{path => {
println(s" Processing file - ${path._1} & Index - ${path._2}") // To show you parallel
dfr.load(path._1)
}
}.reduce(_ union _).show(10,false)
}
// Exiting paste mode, now interpreting.
Processing file - /tmp/data/dira & Index - 0
Processing file - tmp/data/dirc & Index - 3
Processing file - /tmp/data/dirb/dirba & Index - 2
Processing file - /tmp/data/dirb & Index - 1
Processing file - /tmp/data/dird & Index - 4
Processing file - /tmp/data/dird/dire & Index - 5
+----+
|id |
+----+
|1001|
|1002|
|1003|
|1004|
|1005|
|1006|
|1007|
|1008|
|1009|
|1010|
+----+
only showing top 10 rows
Time taken: 446 ms
scala>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.