如何使用 Spark 将 S3 中的 parquet 文件组合成一个 parquet 文件

Question

I have 12 parquet files, each file represent monthly New York Taxi pick up and drop information and consist of +500K rows.我有 12 个镶木地板文件，每个文件代表每月纽约出租车接送信息，包含 +500K 行。 I want to combine all these 12 files by row into 1 parquet file and save it in S3 to do machine learning model.我想将所有这 12 个文件逐行组合成 1 个 parquet 文件并将其保存在 S3 中以进行机器学习模型。 How I can do that using pyspark I will upload these 12 files into AWS S3 files names我如何使用 pyspark 将这 12 个文件上传到 AWS S3文件名中

Answer 1

you can do something like, if all files are in same dir:如果所有文件都在同一个目录中，您可以执行以下操作：

val ds = spark.read.parquet("/path/*").coalesce(1)
ds.write.parquet("/path/single")

or或者

val ds1 = spark.read.parquet("/path1/file")
val ds2 = spark.read.parquet("/path2/anotherlocation/file")
val ds = ds1.union(ds2)
ds.coalesce(1).write.parquet("/path/single")

That is an example using Scala, you can do the same in Java/Python.这是一个使用 Scala 的示例，您可以在 Java/Python 中执行相同的操作。

如何使用 Spark 将 S3 中的 parquet 文件组合成一个 parquet 文件

问题描述

1 个解决方案

解决方案1
0 2022-07-12 19:25:17

如何使用 Spark 将 S3 中的 parquet 文件组合成一个 parquet 文件

问题描述

1 个解决方案

解决方案1 0 2022-07-12 19:25:17

解决方案1
0 2022-07-12 19:25:17