在 Spark 中并行读取来自不同 aws S3 的多个文件

Question

I have a scenario where I would need to read many files (in csv or parquet) from s3 bucket located different locations and with different schema.我有一个场景，我需要从位于不同位置和不同架构的 s3 存储桶中读取许多文件（在 csv 或 parquet 中）。

My purpose of this is to extract all metadata information from different s3 locations and keep it as a Dataframe and save it as csv file in s3 itself.我这样做的目的是从不同的 s3 位置提取所有元数据信息并将其保存为 Dataframe 并将其另存为 csv 文件在 s3 本身中。 The problem here is that I have lot of s3 locations to read the files(partitioned).这里的问题是我有很多 s3 位置来读取文件（分区）。 My sample s3 location is like我的示例 s3 位置就像

s3://myRawbucket/source1/filename1/year/month/day/16/f1.parquet
s3://myRawbucket/source2/filename2/year/month/day/16/f2.parquet
s3://myRawbucket/source3/filename3/year/month/day/16/f3.parquet
s3://myRawbucket/source100/filename100/year/month/day/16/f100.parquet
s3://myRawbucket/source150/filename150/year/month/day/16/f150.parquet    and .......... so on

All I need to do is to use spark code to read these many files (around 200) and apply some transformations if required and extract header information, count information, s3 location information, datatype.我需要做的就是使用 spark 代码读取这么多文件（大约 200 个）并根据需要应用一些转换并提取 header 信息、计数信息、s3 位置信息、数据类型。

What is the efficient way to read all these files(differenct schema ) and process it using spark code (Dataframe) and save it as csv in s3 bucket?读取所有这些文件（不同模式）并使用火花代码（Dataframe）处理它并将其保存为 csv 在 s3 存储桶中的有效方法是什么？ Please bear with me as I am new to spark world.请耐心等待，因为我是 Spark 世界的新手。 I am using python (Pyspark)我正在使用 python (Pyspark)

Answer 1

I think what you want to do is use some Python/Pandas logic and parallelize the jobs with Spark.我想你想要做的是使用一些 Python/Pandas 逻辑并使用 Spark 并行化作业。 Fugue is a good fit for that.赋格曲非常适合这一点。 You can port you logic to Spark with very minimal code changes.您可以通过极少的代码更改将您的逻辑移植到 Spark。 Let's just worry about defining the logic with Python and Pandas first, and then we can bring it to Spark.我们先用Python和Pandas来定义逻辑，然后我们就可以把它带到Spark中了。

First the setup:首先是设置：

import pandas as pd

df = pd.DataFrame({"x": [1,2,3]})
df.to_parquet("/tmp/1.parquet")
df.to_parquet("/tmp/2.parquet")
df.to_parquet("/tmp/3.parquet")

We need a small DataFrame with all the files to orchestrate the jobs with Spark.我们需要一个包含所有文件的小型 DataFrame 来使用 Spark 编排作业。 For example:例如：

file_paths = pd.DataFrame({"path": ["/tmp/1.parquet",
                                    "/tmp/2.parquet",
                                    "/tmp/3.parquet"]})

Now we can create a function that holds the logic for each file.现在我们可以创建一个 function 来保存每个文件的逻辑。 Note that when we bring it to Spark, we will make 1 "job" per file path.请注意，当我们将其引入 Spark 时，我们将为每个文件路径创建 1 个“作业”。 Our function only needs to be able to handle one file at a time.我们的 function 一次只需要能够处理一个文件。

def process(df:pd.DataFrame) -> pd.DataFrame:
    path = df.iloc[0]['path']
    
    tmp = pd.read_parquet(path)
    
    # transformation
    tmp['y'] = tmp['x'] + 1
    
    # save
    tmp.to_parquet(path)
    
    # summary stats
    return pd.DataFrame({"path": [path],
                         'count': [tmp.shape[0]]})

We can test the code:我们可以测试代码：

process(file_paths)

Which gives us:这给了我们：

path    count
/tmp/1.parquet  3

Now we can bring it to Spark using Fugue.现在我们可以使用 Fugue 将其引入 Spark。 We only need the transform() function to bring the logic to Spark.我们只需要transform() function 将逻辑引入 Spark。 The schema is a requirement for Spark.该模式是 Spark 的要求。

import fugue.api as fa
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

out = fa.transform(file_paths, process, schema="path:str,count:int", engine=spark)

# out is a Spark DataFrame
out.show()

The output will be: output 将是：

+--------------+-----+
|          path|count|
+--------------+-----+
|/tmp/1.parquet|    3|
|/tmp/2.parquet|    3|
|/tmp/3.parquet|    3|
+--------------+-----+

在 Spark 中并行读取来自不同 aws S3 的多个文件

问题描述

1 个解决方案

解决方案1
0 2023-01-24 02:48:02

在 Spark 中并行读取来自不同 aws S3 的多个文件

问题描述

1 个解决方案

解决方案1 0 2023-01-24 02:48:02

解决方案1
0 2023-01-24 02:48:02