简体   繁体   English

有什么方法可以使用 spark 从 s3 并行读取多个镶木地板路径?

[英]Is there any way to read multiple parquet paths from s3 in parallel using spark?

My data is stored in s3 (parquet format) under different paths and I'm using spark.read.parquet(pathes:_*) in order to read all the paths into one dataframe. Unfortunately, spark reads the parquet metadata sequentially (path after path) and not in parallel.我的数据存储在不同路径下的 s3(镶木地板格式)中,我正在使用spark.read.parquet(pathes:_*)将所有路径读入一个 dataframe。不幸的是,spark 按顺序读取镶木地板元数据(路径在路径之后)而不是并行。 after spark reads the metadata, the data itself is getting read in parallel. spark 读取元数据后,数据本身将被并行读取。 but the metadata part is super slow and the machines are underutilized.但是元数据部分超级慢,而且机器没有得到充分利用。

Is there any way to read multiple parquet paths from s3 in parallel using spark?有什么方法可以使用 spark 从 s3 并行读取多个镶木地板路径?

I would appreciate hearing your opinion on this.我很高兴听到你对此的意见。

So after some time I've figured out that the way I can achieve it is by reading each path on a different thread and union the results.所以一段时间后,我发现我可以通过读取不同线程上的每条路径并合并结果来实现它。 eg:例如:

val paths = List[String]("a","b","c")
val parallelPaths = paths.par
parallelPaths.tasksupport = new ForkJoinTaskSupport(new scala.concurrent.forkjoin.ForkJoinPool(paths.length))
paths.map(path => spark.read.parquet(path)).reduce(_ union _)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM