简体   繁体   中英

How to retrieve the output file from Amazon S3 generated from EMR Pyspark back into Flask

I am currently trying to connect my Flask application to Amazon EMR using pyspark. I am using the example in AWS ( https://docs.aws.amazon.com/code-samples/latest/catalog/code-catalog-python-example_code-emr.html ) for the pyspark. I use the following codes to output the files:

df.write.mode('overwrite').csv('s3://my-bucket/output')

The output files from Amazon EMR are stored inside Amazon S3 with the following names:

  1. part-00003-2e96c921-8459-4dc9-93e7-3c71eccd442f-c000.csv
  2. part-00007-2e96c921-8459-4dc9-93e7-3c71eccd442f-c000.csv
  3. part-00011-2e96c921-8459-4dc9-93e7-3c71eccd442f-c000.csv

I would like to read the CSV files into my Flask application. How am I supposed to read these files since the filenames are different every time? Is there any smarter way to do it?

I assume that you are trying to read them into one dataframe. (Also, since 'part' prefix will be common as per your comment)

s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')

prefix_objs = bucket.objects.filter(Prefix="output/part")

prefix_df = []

for obj in prefix_objs:
    try:
        key = obj.key
        body = obj.get()['Body'].read()
        temp = pd.read_csv(io.BytesIO(body),header=None, encoding='utf8',sep=',')        
        prefix_df.append(temp)
    except:
        continue

This will read everyfile in your bucket with prefix 'part' in output folder and add to array.

Afterwards, you can concat it as

pd.concat(prefix_df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM