简体   繁体   English

如何从 EMR Pyspark 生成的 Amazon S3 中检索 output 文件回 Flask

[英]How to retrieve the output file from Amazon S3 generated from EMR Pyspark back into Flask

I am currently trying to connect my Flask application to Amazon EMR using pyspark.我目前正在尝试使用 pyspark 将我的 Flask 应用程序连接到 Amazon EMR。 I am using the example in AWS ( https://docs.aws.amazon.com/code-samples/latest/catalog/code-catalog-python-example_code-emr.html ) for the pyspark.我正在使用 AWS 中的示例( https://docs.aws.amazon.com/code-samples/latest/catalog/code-catalog-python-example_code-emr.html )用于 Z77EBB59DCD8956590068 I use the following codes to output the files:我使用以下代码 output 文件:

df.write.mode('overwrite').csv('s3://my-bucket/output') df.write.mode('overwrite').csv('s3://my-bucket/output')

The output files from Amazon EMR are stored inside Amazon S3 with the following names: Amazon EMR 中的 output 文件存储在 Amazon S3 中,名称如下:

  1. part-00003-2e96c921-8459-4dc9-93e7-3c71eccd442f-c000.csv零件-00003-2e96c921-8459-4dc9-93e7-3c71eccd442f-c000.csv
  2. part-00007-2e96c921-8459-4dc9-93e7-3c71eccd442f-c000.csv零件-00007-2e96c921-8459-4dc9-93e7-3c71eccd442f-c000.csv
  3. part-00011-2e96c921-8459-4dc9-93e7-3c71eccd442f-c000.csv零件-00011-2e96c921-8459-4dc9-93e7-3c71eccd442f-c000.csv

I would like to read the CSV files into my Flask application.我想将 CSV 文件读入我的 Flask 应用程序。 How am I supposed to read these files since the filenames are different every time?由于文件名每次都不同,我应该如何阅读这些文件? Is there any smarter way to do it?有没有更聪明的方法呢?

I assume that you are trying to read them into one dataframe.我假设您正在尝试将它们读入 dataframe。 (Also, since 'part' prefix will be common as per your comment) (另外,因为根据您的评论,“部分”前缀很常见)

s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')

prefix_objs = bucket.objects.filter(Prefix="output/part")

prefix_df = []

for obj in prefix_objs:
    try:
        key = obj.key
        body = obj.get()['Body'].read()
        temp = pd.read_csv(io.BytesIO(body),header=None, encoding='utf8',sep=',')        
        prefix_df.append(temp)
    except:
        continue

This will read everyfile in your bucket with prefix 'part' in output folder and add to array.这将读取output文件夹中带有前缀“part”的存储桶中的每个文件并添加到数组中。

Afterwards, you can concat it as之后,您可以将其连接为

pd.concat(prefix_df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM