如何从 EMR Pyspark 生成的 Amazon S3 中检索 output 文件回 Flask

Question

I am currently trying to connect my Flask application to Amazon EMR using pyspark.我目前正在尝试使用 pyspark 将我的 Flask 应用程序连接到 Amazon EMR。 I am using the example in AWS ( https://docs.aws.amazon.com/code-samples/latest/catalog/code-catalog-python-example_code-emr.html ) for the pyspark.我正在使用 AWS 中的示例（ https://docs.aws.amazon.com/code-samples/latest/catalog/code-catalog-python-example_code-emr.html ）用于 Z77EBB59DCD8956590068 I use the following codes to output the files:我使用以下代码 output 文件：

df.write.mode('overwrite').csv('s3://my-bucket/output') df.write.mode('overwrite').csv('s3://my-bucket/output')

The output files from Amazon EMR are stored inside Amazon S3 with the following names: Amazon EMR 中的 output 文件存储在 Amazon S3 中，名称如下：

part-00003-2e96c921-8459-4dc9-93e7-3c71eccd442f-c000.csv零件-00003-2e96c921-8459-4dc9-93e7-3c71eccd442f-c000.csv
part-00007-2e96c921-8459-4dc9-93e7-3c71eccd442f-c000.csv零件-00007-2e96c921-8459-4dc9-93e7-3c71eccd442f-c000.csv
part-00011-2e96c921-8459-4dc9-93e7-3c71eccd442f-c000.csv零件-00011-2e96c921-8459-4dc9-93e7-3c71eccd442f-c000.csv

I would like to read the CSV files into my Flask application.我想将 CSV 文件读入我的 Flask 应用程序。 How am I supposed to read these files since the filenames are different every time?由于文件名每次都不同，我应该如何阅读这些文件？ Is there any smarter way to do it?有没有更聪明的方法呢？

Answer 1

I assume that you are trying to read them into one dataframe.我假设您正在尝试将它们读入 dataframe。 (Also, since 'part' prefix will be common as per your comment) （另外，因为根据您的评论，“部分”前缀很常见）

s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')

prefix_objs = bucket.objects.filter(Prefix="output/part")

prefix_df = []

for obj in prefix_objs:
    try:
        key = obj.key
        body = obj.get()['Body'].read()
        temp = pd.read_csv(io.BytesIO(body),header=None, encoding='utf8',sep=',')        
        prefix_df.append(temp)
    except:
        continue

This will read everyfile in your bucket with prefix 'part' in output folder and add to array.这将读取output文件夹中带有前缀“part”的存储桶中的每个文件并添加到数组中。

Afterwards, you can concat it as之后，您可以将其连接为

pd.concat(prefix_df)

如何从 EMR Pyspark 生成的 Amazon S3 中检索 output 文件回 Flask

问题描述

1 个解决方案

解决方案1
0 2021-05-11 18:30:51

如何从 EMR Pyspark 生成的 Amazon S3 中检索 output 文件回 Flask

问题描述

1 个解决方案

解决方案1 0 2021-05-11 18:30:51

解决方案1
0 2021-05-11 18:30:51