How to retrieve the output file from Amazon S3 generated from EMR Pyspark back into Flask

Question

I am currently trying to connect my Flask application to Amazon EMR using pyspark. I am using the example in AWS ( https://docs.aws.amazon.com/code-samples/latest/catalog/code-catalog-python-example_code-emr.html ) for the pyspark. I use the following codes to output the files:

df.write.mode('overwrite').csv('s3://my-bucket/output')

The output files from Amazon EMR are stored inside Amazon S3 with the following names:

part-00003-2e96c921-8459-4dc9-93e7-3c71eccd442f-c000.csv
part-00007-2e96c921-8459-4dc9-93e7-3c71eccd442f-c000.csv
part-00011-2e96c921-8459-4dc9-93e7-3c71eccd442f-c000.csv

I would like to read the CSV files into my Flask application. How am I supposed to read these files since the filenames are different every time? Is there any smarter way to do it?

Answer 1

I assume that you are trying to read them into one dataframe. (Also, since 'part' prefix will be common as per your comment)

s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')

prefix_objs = bucket.objects.filter(Prefix="output/part")

prefix_df = []

for obj in prefix_objs:
    try:
        key = obj.key
        body = obj.get()['Body'].read()
        temp = pd.read_csv(io.BytesIO(body),header=None, encoding='utf8',sep=',')        
        prefix_df.append(temp)
    except:
        continue

This will read everyfile in your bucket with prefix 'part' in output folder and add to array.

Afterwards, you can concat it as

pd.concat(prefix_df)

How to retrieve the output file from Amazon S3 generated from EMR Pyspark back into Flask

Question

1 answers

solution1
0 2021-05-11 18:30:51

How to retrieve the output file from Amazon S3 generated from EMR Pyspark back into Flask

Question

1 answers

solution1 0 2021-05-11 18:30:51

solution1
0 2021-05-11 18:30:51